#####資料讀取#####
library(tidyverse)#裡面有管線
library(dplyr)
customer = read_csv("olist_data/olist_customers_dataset.csv")
geolocation =read_csv("olist_data/olist_geolocation_dataset.csv")
orders = read_csv("olist_data/olist_orders_dataset.csv")
order_items = read_csv("olist_data/olist_order_items_dataset.csv")
order_payments = read_csv("olist_data/olist_order_payments_dataset.csv")
order_reviews = read_csv("olist_data/olist_order_reviews_dataset.csv")
products = read_csv("olist_data/olist_products_dataset.csv")
sellers = read_csv("olist_data/olist_sellers_dataset.csv")
category = read_csv("olist_data/product_category_name_translation.csv")
Each feature or columns of different csv files are described
below:
The customers dataset contain following features:
| Feature | Description |
|---|---|
| customer_id | key to the orders dataset. Each order has a unique customer_id. |
| customer_unique_id | unique identifier of a customer. |
| customer_zip_code_prefix | Zip Code of the location of the consumer. |
| customer_city | customer city name. |
| customer_state | customer state. |
The geolocation dataset contain following features:
| Feature | Description |
|---|---|
| geolocation_zip_code_prefix | Zip Code of the location. |
| geolocation_lat | latitude. |
| geolocation_lng | longitude. |
| geolocation_city | city name. |
The order_items dataset contain following features:
| Feature | Description |
|---|---|
| order_id | order unique identifier. |
| order_item_id | sequential number identifying number of items included in the same order. |
| product_id | product unique identifier. |
| seller_id | seller unique identifier. |
| shipping_limit_date | Shows the seller shipping limit date for handling the order over to the logistic partner. |
| price | item price. |
| freight_value | item freight value item (if an order has more than one item the freight value is splitted between items). |
The order_payments dataset contain following features:
| Feature | Description |
|---|---|
| order_id | unique identifier of an order. |
| payment_sequential | a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments. |
| payment_type | method of payment chosen by the customer. |
| payment_installments | Nnumber of installments chosen by the customer. |
| payment_value | transaction value. |
The order_reviews dataset contain following features:
| Feature | Description |
|---|---|
| review_id | unique review identifier. |
| order_id | unique order identifier. |
| review_score | Note ranging from 1 to 5 given by the customer on a satisfaction survey. |
| review_comment_title | Comment title from the review left by the customer, in Portuguese. |
| review_comment_message | Comment message from the review left by the customer, in Portuguese. |
| review_creation_date | Shows the date in which the satisfaction survey was sent to the customer. |
| review_answer_timestamp | Shows satisfaction survey answer timestamp. |
The orders dataset contain following features:
| Feature | Description |
|---|---|
| order_id | unique identifier of the order. |
| customer_id | key to the customer dataset. Each order has a unique customer_id. |
| order_status | Reference to the order status (delivered, shipped, etc). |
| order_purchase_timestamp | Shows the purchase timestamp. |
| order_approved_at | Shows the payment approval timestamp. |
| order_delivered_carrier_date | Shows the order posting timestamp. When it was handled to the logistic partner. |
| order_delivered_customer_date | Shows the actual order delivery date to the customer. |
| order_estimated_delivery_date | Shows the estimated delivery date that was informed to customer at the purchase moment. |
The products dataset contain following features:
| Feature | Description |
|---|---|
| product_id | unique product identifier. |
| product_category_name | root category of product, in Portuguese. |
| product_name_lenght | number of characters extracted from the product name. |
| product_description_lenght | number of characters extracted from the product description. |
| product_photos_qty | number of product published photos. |
| product_weight_g | product weight measured in grams. |
| product_length_cm | product length measured in centimeters. |
| product_height_cm | product height measured in centimeters. |
| product_width_cm | product width measured in centimeters. |
The sellers dataset contain following features:
| Feature | Description |
|---|---|
| seller_id | seller unique identifier. |
| seller_zip_code_prefix | Zip Code of the location of the seller. |
| seller_city | seller city name. |
| seller_state | seller state |
The category dataset contain following features:
| Feature | Description |
|---|---|
| product_category_name | category name in Portuguese. |
| product_category_name_english | category name in English. |
####summary####
options(width = 100)
library(funModeling)
df_status(customer)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 customer_id 0 0 0 0 0 0 character 99441
## 2 customer_unique_id 0 0 0 0 0 0 character 96096
## 3 customer_zip_code_prefix 0 0 0 0 0 0 character 14994
## 4 customer_city 0 0 0 0 0 0 character 4119
## 5 customer_state 0 0 0 0 0 0 character 27
df_status(geolocation)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 geolocation_zip_code_prefix 0 0 0 0 0 0 character 19015
## 2 geolocation_lat 0 0 0 0 0 0 numeric 717363
## 3 geolocation_lng 0 0 0 0 0 0 numeric 717615
## 4 geolocation_city 0 0 0 0 0 0 character 8010
## 5 geolocation_state 0 0 0 0 0 0 character 27
df_status(order_items)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 order_id 0 0.00 0 0 0 0 character 98666
## 2 order_item_id 0 0.00 0 0 0 0 numeric 21
## 3 product_id 0 0.00 0 0 0 0 character 32951
## 4 seller_id 0 0.00 0 0 0 0 character 3095
## 5 shipping_limit_date 0 0.00 0 0 0 0 POSIXct/POSIXt 93318
## 6 price 0 0.00 0 0 0 0 numeric 5968
## 7 freight_value 383 0.34 0 0 0 0 numeric 6999
df_status(order_payments)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 order_id 0 0.00 0 0 0 0 character 99440
## 2 payment_sequential 0 0.00 0 0 0 0 numeric 29
## 3 payment_type 0 0.00 0 0 0 0 character 5
## 4 payment_installments 2 0.00 0 0 0 0 numeric 24
## 5 payment_value 9 0.01 0 0 0 0 numeric 29077
df_status(sellers)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 seller_id 0 0 0 0 0 0 character 3095
## 2 seller_zip_code_prefix 0 0 0 0 0 0 character 2246
## 3 seller_city 0 0 0 0 0 0 character 611
## 4 seller_state 0 0 0 0 0 0 character 23
df_status(category)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 product_category_name 0 0 0 0 0 0 character 71
## 2 product_category_name_english 0 0 0 0 0 0 character 71
df_status(sellers)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 seller_id 0 0 0 0 0 0 character 3095
## 2 seller_zip_code_prefix 0 0 0 0 0 0 character 2246
## 3 seller_city 0 0 0 0 0 0 character 611
## 4 seller_state 0 0 0 0 0 0 character 23
df_status(category)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 product_category_name 0 0 0 0 0 0 character 71
## 2 product_category_name_english 0 0 0 0 0 0 character 71
df_status(orders)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 order_id 0 0 0 0.00 0 0 character 99441
## 2 customer_id 0 0 0 0.00 0 0 character 99441
## 3 order_status 0 0 0 0.00 0 0 character 8
## 4 order_purchase_timestamp 0 0 0 0.00 0 0 POSIXct/POSIXt 98875
## 5 order_approved_at 0 0 160 0.16 0 0 POSIXct/POSIXt 90733
## 6 order_delivered_carrier_date 0 0 1783 1.79 0 0 POSIXct/POSIXt 81018
## 7 order_delivered_customer_date 0 0 2965 2.98 0 0 POSIXct/POSIXt 95664
## 8 order_estimated_delivery_date 0 0 0 0.00 0 0 POSIXct/POSIXt 459
df_status(order_reviews)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 review_id 0 0.00 0 0.00 0 0 character 98410
## 2 order_id 0 0.00 0 0.00 0 0 character 98673
## 3 review_score 0 0.00 0 0.00 0 0 numeric 5
## 4 review_comment_title 13 0.01 87658 88.34 0 0 character 4178
## 5 review_comment_message 1 0.00 58256 58.71 0 0 character 35743
## 6 review_creation_date 0 0.00 0 0.00 0 0 POSIXct/POSIXt 636
## 7 review_answer_timestamp 0 0.00 0 0.00 0 0 POSIXct/POSIXt 98248
df_status(products)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 product_id 0 0.00 0 0.00 0 0 character 32951
## 2 product_category_name 0 0.00 610 1.85 0 0 character 73
## 3 product_name_lenght 0 0.00 610 1.85 0 0 numeric 66
## 4 product_description_lenght 0 0.00 610 1.85 0 0 numeric 2960
## 5 product_photos_qty 0 0.00 610 1.85 0 0 numeric 19
## 6 product_weight_g 4 0.01 2 0.01 0 0 numeric 2204
## 7 product_length_cm 0 0.00 2 0.01 0 0 numeric 99
## 8 product_height_cm 0 0.00 2 0.01 0 0 numeric 102
## 9 product_width_cm 0 0.00 2 0.01 0 0 numeric 95
缺失值主要來自:orders、order_reviews、products資料集
library(naniar)
orders %>% vis_miss()
order_reviews %>% vis_miss()
products %>% vis_miss()
variable_set=list(customer,geolocation,orders,order_items,order_payments,order_reviews,products,sellers,category)
n_colums <- map(variable_set,ncol) %>% unlist()
#算出各變數的缺失值個數
variable_set=c(customer,geolocation,orders,order_items,order_payments,order_reviews,products,sellers,category)
missing_m <- variable_set%>%
map_int(~ sum(is.na(.))) %>%
as_tibble()
n <- which(missing_m>0)
c <- c()
n_rows <- variable_set %>% map(.,length)
n_rows <- n_rows [n] %>% unlist()
for (i in 1:length(n)) {
if (n[i]<=n_colums[1]){
print("customer")
c <- c(c,"customer")
}
else if (n_colums[1]<n[i] && n[i]<=sum(n_colums[1:2])){
print("geolocation")
c <- c(c,"geolocation")
}
else if (sum(n_colums[1:2])<n[i] && n[i]<=sum(n_colums[1:3])){
print("orders")
c <- c(c,"orders")
}
else if (sum(n_colums[1:3])<n[i] && n[i]<=sum(n_colums[1:4])){
print("order_items")
c <- c(c,"order_items")
}
else if (sum(n_colums[1:4])<n[i] && n[i]<=sum(n_colums[1:5])){
print("order_payments")
c <- c(c,"order_payments")
}
else if (sum(n_colums[1:5])<n[i] && n[i]<=sum(n_colums[1:6])){
print("order_reviews")
c <- c(c,"order_reviews")
}
else if (sum(n_colums[1:6])<n[i] && n[i]<=sum(n_colums[1:7])){
print("products")
c <- c(c,"products")
}
else if (sum(n_colums[1:7])<n[i] && n[i]<=sum(n_colums[1:8])){
print("sellers")
c <- c(c,"sellers")
}
else if (sum(n_colums[1:8])<n[i] && n[i]<=sum(n_colums[1:9])){
print("category")
c <- c(c,"category")
}
}
## [1] "orders"
## [1] "orders"
## [1] "orders"
## [1] "order_reviews"
## [1] "order_reviews"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
## [1] "products"
#抓缺失資料及總筆數
c <- c %>%as_tibble() %>% count(value) %>% as.data.frame()
missing_m <- variable_set%>%
map_int(~ sum(is.na(.))) %>%
.[which(.>0)]
n <- missing_m %>%
names() %>% as.data.frame()
missing_m <- missing_m %>%
as_tibble() %>%
cbind(n,.) %>%
as.data.frame() %>%
mutate(all_count=n_rows)
names(missing_m) <- c("variable","count","all_count")
missing_m <- missing_m %>%
mutate(label=c(rep("orders",3),rep("order_reviews",2),rep("products",8)),
ratio=round(count/all_count, digits =3),
percent=round((count/all_count)*100, digits =3))
##缺失值可視化
library(colorspace)
library(patchwork)#p1 + p2
p1 <- missing_m %>%
ggplot(aes(x = reorder(variable,count),y = count,color=label,fill=label)) +
geom_col()+ #顯示數值
coord_flip()+#翻轉XY軸
xlab("variable")+
theme(legend.position = "none")+
ggtitle("The number of missing value")+#顯示主題
geom_text(aes(label = count, #顯示bar數值
hjust =ifelse(count > 6000, 1.5, -0.3)),
size =3.5)+
scale_fill_discrete_sequential(palette = "Viridis", #調整bar的顏色
nmax = 6,
rev = FALSE,
order = c(2,3,5),
name="Data from:")+
scale_color_discrete_sequential(palette = "Grays", #調整字的顏色
nmax = 3,
rev = FALSE,
order = c(3,1,1),
name="Data from:")
p2 <- missing_m %>%
ggplot(aes(x = reorder(variable,ratio),
y = ratio,
color=label,
fill=label)) +
geom_col()+
coord_flip()+
xlab("variable")+
theme(legend.position = c(0.75,0.15),
axis.text.y = element_blank())+
ggtitle("The ratio of missing value")+
geom_text(aes(label = scales::percent(ratio),
hjust =ifelse(ratio > 0.5, 1.5, -0.3)),
size =3.5)+
scale_y_continuous(labels = scales::percent)+
scale_fill_discrete_sequential(palette = "Viridis",
nmax = 6,
rev = FALSE,
order = c(2,3,5),
name="Data from:")+
scale_color_discrete_sequential(palette = "Grays",
nmax = 3,
rev = FALSE,
order = c(3,1,1),
name="Data from:")
p1+p2
1.order_reviews的變量裡:
review_comment_title與review_comment_message
變量單獨遺失了超過50%的資訊,因此直接考慮刪除。
2.orders 與 products的觀測值缺失率少於5%,因此考慮直接刪除。
#####缺失值刪除#####
library(mice)
library(VIM)
# 1.刪變量 order_reviews
c1 <- names(which(missing_m$ratio>0.5))
location1 <- grep(c1[1],names(order_reviews))
location2 <- grep(c1[2],names(order_reviews))
order_reviews1 <- order_reviews[,c(-location1,-location2)]
# 2.刪觀測值 products、orders
# complete.cases():
# 當一筆資料(row)是完整的,回傳TRUE;當一筆資料有遺漏值,回傳FALSE
products1 <- products[complete.cases(products),]
orders1 <- orders[complete.cases(orders),]
# 3.確認沒有缺失值
orders1 %>% vis_miss()
order_reviews1 %>% vis_miss()
products1 %>% vis_miss()
orders1 %>%
filter(if_any(everything(), is.na))
## # A tibble: 0 × 8
## # ℹ 8 variables: order_id <chr>, customer_id <chr>, order_status <chr>,
## # order_purchase_timestamp <dttm>, order_approved_at <dttm>, order_delivered_carrier_date <dttm>,
## # order_delivered_customer_date <dttm>, order_estimated_delivery_date <dttm>
order_reviews1 %>%
filter(if_any(everything(), is.na))
## # A tibble: 0 × 5
## # ℹ 5 variables: review_id <chr>, order_id <chr>, review_score <dbl>, review_creation_date <dttm>,
## # review_answer_timestamp <dttm>
products1 %>%
filter(if_any(everything(), is.na))
## # A tibble: 0 × 9
## # ℹ 9 variables: product_id <chr>, product_category_name <chr>, product_name_lenght <dbl>,
## # product_description_lenght <dbl>, product_photos_qty <dbl>, product_weight_g <dbl>,
## # product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>
#####新增資料資訊 for orders/order_items2 #####
# wday()取出日期在一個星期內的序號,預設星期天為1,星期一為2。
#orders
orders2 <- orders1 %>%
mutate(.,
purchase_timestamp_year = year(orders1$order_purchase_timestamp),
purchase_timestamp_month = month(orders1$order_purchase_timestamp),
purchase_timestamp_day = mday(orders1$order_purchase_timestamp),
purchase_timestamp_week = wday(orders1$order_purchase_timestamp, week_start = 1),
purchase_timestamp_hour = hour(orders1$order_purchase_timestamp),
)
orders02 <- orders %>%
mutate(.,
purchase_timestamp_year = year(orders$order_purchase_timestamp),
purchase_timestamp_month = month(orders$order_purchase_timestamp),
purchase_timestamp_day = mday(orders$order_purchase_timestamp),
purchase_timestamp_week = wday(orders$order_purchase_timestamp, week_start = 1),
purchase_timestamp_hour = hour(orders$order_purchase_timestamp),
)
#order_items
order_items2 <- order_items %>%
mutate(.,
shipping_limit_year = year(order_items$shipping_limit_date),
shipping_limit_month = month(order_items$shipping_limit_date),
shipping_limit_day = mday(order_items$shipping_limit_date),
shipping_limit_week = wday(order_items$shipping_limit_date, week_start = 1),
shipping_limit_hour = hour(order_items$shipping_limit_date),
)
串接方式:customer_id,新名稱:orders_customer2
#####串接orders和customer資料,串接方式:customer_id,新名稱:orders_customer2 #####
#因orders2有缺失值砍了2980的觀測值,所以串接的時候資料有少
#所以最後改成orders串接customer,再刪缺失的觀測值(因為missing=0.4%)
n_distinct(orders$customer_id) == nrow(orders) # TRUE
## [1] TRUE
n_distinct(customer$customer_id) == nrow(customer) # TRUE
## [1] TRUE
orders_customer <-
left_join(orders, customer, by = "customer_id")
sum(is.na(orders_customer)) #sum=0
## [1] 4908
ptp <- orders_customer %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for orders_customer")
orders_customer2 <- orders_customer[complete.cases(orders_customer),]
sum(is.na(orders_customer2)) #sum=0
## [1] 0
nrow(orders_customer2)
## [1] 96461
orders_customer3 <- orders_customer2 %>%
mutate(.,
purchase_timestamp_year = year(orders_customer2$order_purchase_timestamp),
purchase_timestamp_month = month(orders_customer2$order_purchase_timestamp),
purchase_timestamp_day = mday(orders_customer2$order_purchase_timestamp),
purchase_timestamp_week = wday(orders_customer2$order_purchase_timestamp, week_start = 1),
purchase_timestamp_hour = hour(orders_customer2$order_purchase_timestamp)
)
串接方式:order_id,新名稱:orders_customer2_items2
#####串接orders_customer2和order_items2資料,串接方式:order_id,新名稱:orders_customer2_items2#####
n_distinct(order_items2$order_id) == nrow(order_items2)
## [1] FALSE
n_distinct(orders_customer2$order_id) == nrow(orders_customer2)
## [1] TRUE
orders_customer2_items2 <-
left_join(orders_customer2, order_items2 , by = "order_id")
sum(is.na(orders_customer2_items2)) #sum=0
## [1] 0
nrow(orders_customer2_items2)
## [1] 110180
串接方式:product_category_name,新名稱:products_new1
#####串接products和category,串接方式:product_category_name,新名稱:products_new #####
n_distinct(category$product_category_name) == nrow(category) # TRUE
## [1] TRUE
n_distinct(products$product_category_name) == nrow(products) # FALSE
## [1] FALSE
products_new <-
left_join(products, category, by = "product_category_name")
sum(is.na(products_new)) #sum!=0
## [1] 3071
sum(is.na(products$product_category_name)) #sum!=0
## [1] 610
ptp <- products_new %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for products_new")
products_new1 <- products_new[complete.cases(products_new),]
sum(is.na(products_new1)) #sum=0
## [1] 0
nrow(products_new1 )
## [1] 32327
串接方式:product_id,新名稱:items2_products1
######串接order_items2和products_new,串接方式:product_id,新名稱:items2_products1######
n_distinct(order_items2$product_id) == nrow(order_items2)
## [1] FALSE
n_distinct(products_new1$product_id) == nrow(products_new1)
## [1] TRUE
items2_products <-
left_join(order_items2,products_new , by = "product_id")
sum(is.na(items2_products)) #sum!=0 ,products 本身有缺失值
## [1] 8111
ptp <- items2_products %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for items2_products")
items2_products1 <- items2_products[complete.cases(items2_products),]
sum(is.na(items2_products1)) #sum=0
## [1] 0
nrow(items2_products1)
## [1] 111022
串接方式:order_id,新名稱:items2_orders2
#####串接order_items2和orders,串接方式:order_id,新名稱:items2_orders#####
items2_orders<- left_join( order_items2,orders, by = "order_id")
sum(is.na(items2_orders)) #sum=0
## [1] 3663
library(naniar)#缺失值套件
ptp <-items2_orders %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for order_items2_orders")
# complete.cases():
items2_orders<-items2_orders[complete.cases(items2_orders),]
sum(is.na(items2_orders)) #sum=0
## [1] 0
nrow(items2_orders)
## [1] 110180
items2_orders2 <- items2_orders %>%
mutate(.,
purchase_timestamp_year = year(items2_orders$order_purchase_timestamp),
purchase_timestamp_month = month(items2_orders$order_purchase_timestamp),
purchase_timestamp_day = mday(items2_orders$order_purchase_timestamp),
purchase_timestamp_week = wday(items2_orders$order_purchase_timestamp, week_start = 1),
purchase_timestamp_hour = hour(items2_orders$order_purchase_timestamp),
purchase_timestamp_month1 = format(items2_orders$order_purchase_timestamp, "%Y-%m")
)
串接方式:zip_code_prefix,新名稱:customer_zip1
#####geolocation 畫巴西地圖(已完成)#####
names(customer)[3] <- "zip_code_prefix"
names(sellers)[2] <- "zip_code_prefix"
names(geolocation)[1] <- "zip_code_prefix"
#geolocation_group 地理資訊太多 組起來
geolocation_group <- geolocation %>% group_by(zip_code_prefix) %>%
summarise(
geolocation_lat = mean(geolocation_lat),
geolocation_lng = mean(geolocation_lng),
geolocation_city = names(table(geolocation_city))[1],
geolocation_state = names(table(geolocation_state))[1]
)
#customer_zip
n_distinct(customer$zip_code_prefix) == nrow(customer)
## [1] FALSE
n_distinct(geolocation_group$zip_code_prefix) == nrow(geolocation_group)
## [1] TRUE
sum(is.na(customer))
## [1] 0
sum(is.na(geolocation_group))
## [1] 0
customer_zip <- left_join(customer, geolocation_group, by = "zip_code_prefix")
sum(is.na(customer_zip)) #sum!=0
## [1] 1112
ptp <- customer_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for customer_zip")
customer_zip1 <- customer_zip[complete.cases(customer_zip),]
sum(is.na(customer_zip1)) #sum=0
## [1] 0
nrow(customer_zip1)
## [1] 99163
串接方式:zip_code_prefix,新名稱:sellers_zip1
#sellers_zip
n_distinct(sellers$zip_code_prefix) == nrow(sellers)
## [1] FALSE
n_distinct(geolocation_group$zip_code_prefix) == nrow(geolocation_group)
## [1] TRUE
sum(is.na(sellers))
## [1] 0
sum(is.na(geolocation_group))
## [1] 0
sellers_zip <- left_join(sellers, geolocation_group, by = "zip_code_prefix")
sum(is.na(sellers_zip)) #sum!=0
## [1] 28
ptp <- sellers_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for sellers_zip")
sellers_zip1 <- sellers_zip[complete.cases(sellers_zip),]
sum(is.na(sellers_zip1)) #sum=0
## [1] 0
nrow(sellers_zip1)
## [1] 3088
串接方式:customer_id,新名稱:orders_customer_distance
#####串接orders和customer_zip,串接方式:customer_id,新名稱:orders_customer_distance#####
#orders_customer_distance1 可用data
n_distinct(customer_zip$customer_id) == nrow(customer_zip)
## [1] TRUE
n_distinct(orders02$customer_id) == nrow(orders02)
## [1] TRUE
sum(is.na(customer_zip))
## [1] 1112
sum(is.na(orders02))
## [1] 4908
orders_customer_distance <- left_join(orders02, customer_zip, by = "customer_id")
sum(is.na(orders_customer_distance)) #sum!=0
## [1] 6020
ptp <- orders_customer_distance %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for orders_customer_distance")
orders_customer_distance1 <- orders_customer_distance[complete.cases(orders_customer_distance),]
sum(is.na(orders_customer_distance1)) #sum=0
## [1] 0
nrow(orders_customer_distance1)
## [1] 96197
names(orders_customer_distance)[15:21] <- c("customer_zip_code_prefix","customer_city","customer_state","customer_lat","customer_lng","customer_geolocation_city","customer_geolocation_state")
串接方式:seller_id,新名稱:items2_sellers_distance
#####串接order_items2和sellers_zip,串接方式:seller_id,新名稱:items2_sellers_distance#####
#sellers_zip
n_distinct(sellers_zip$seller_id) == nrow(sellers_zip)
## [1] TRUE
n_distinct(order_items2$seller_id) == nrow(order_items2)
## [1] FALSE
sum(is.na(sellers_zip))
## [1] 28
sum(is.na(order_items2))
## [1] 0
items2_sellers_distance <- left_join(order_items2, sellers_zip, by = "seller_id")
sum(is.na(items2_sellers_distance)) #sum!=0
## [1] 1012
ptp <- items2_sellers_distance %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for items2_sellers_distance")
items2_sellers_distance1 <- items2_sellers_distance[complete.cases(items2_sellers_distance),]
sum(is.na(items2_sellers_distance1)) #sum=0
## [1] 0
nrow(items2_sellers_distance1)
## [1] 112397
names(items2_sellers_distance)[13:19] <- c("seller_zip_code_prefix","seller_city","seller_state","seller_lat","seller_lng","seller_geolocation_city","seller_geolocation_state")
串接方式:order_id,新名稱:customer_sellers_zip
#####串接orders_customer_distance和items2_sellers_distance,串接方式:order_id,新名稱:customer_sellers_zip #####
n_distinct(orders_customer_distance$order_id) == nrow(orders_customer_distance)
## [1] TRUE
n_distinct(items2_sellers_distance$order_id) == nrow(items2_sellers_distance)
## [1] FALSE
sum(is.na(orders_customer_distance))
## [1] 6020
sum(is.na(items2_sellers_distance))
## [1] 1012
customer_sellers_zip <- left_join(orders_customer_distance, items2_sellers_distance, by = "order_id")
sum(is.na(customer_sellers_zip)) #sum!=0
## [1] 21544
nrow(customer_sellers_zip)
## [1] 113425
ptp <- customer_sellers_zip %>% vis_miss(warn_large_data = FALSE)
ptp + ggtitle("Missing Values Visualization for customer_sellers_zip")
customer_sellers_zip1 <- customer_sellers_zip[complete.cases(customer_sellers_zip),]
sum(is.na(customer_sellers_zip1)) #sum=0
## [1] 0
nrow(customer_sellers_zip1)
## [1] 109644
#####order_id_cost#####
order_id_cost <- order_items2 %>%
group_by(order_id) %>%
summarise(
items_buy = n(),#總共買多少商品
n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
n_distinct_product = n_distinct(product_id),#同一訂單一次買多少種不同產品
product_total_cost = sum(price),#總商品花費
Fright_total_cost = sum(freight_value),#總商品運費
Total_cost = sum(price)+sum(freight_value),#總花費
max_cost = max(price), #消費單價最高的商品價錢
min_cost = min(price), #消費單價最低的商品價錢
avg_Fright = round(mean(freight_value),digits = 2),#平均運費
avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
)
#總覽
summary(order_id_cost)
## order_id items_buy n_distinct_shipping_limit_day n_distinct_product
## Length:98666 Min. : 1.000 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Median : 1.000 Median :1.000 Median :1.000
## Mean : 1.142 Mean :1.004 Mean :1.038
## 3rd Qu.: 1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :21.000 Max. :3.000 Max. :8.000
## product_total_cost Fright_total_cost Total_cost max_cost min_cost
## Min. : 0.85 Min. : 0.00 Min. : 9.59 Min. : 0.85 Min. : 0.85
## 1st Qu.: 45.90 1st Qu.: 13.85 1st Qu.: 61.98 1st Qu.: 42.20 1st Qu.: 40.00
## Median : 86.90 Median : 17.17 Median : 105.29 Median : 79.90 Median : 79.00
## Mean : 137.75 Mean : 22.82 Mean : 160.58 Mean : 126.63 Mean : 125.29
## 3rd Qu.: 149.90 3rd Qu.: 24.04 3rd Qu.: 176.87 3rd Qu.: 139.90 3rd Qu.: 139.84
## Max. :13440.00 Max. :1794.96 Max. :13664.08 Max. :6735.00 Max. :6735.00
## avg_Fright avg_cost_nproduct
## Min. : 0.00 Min. : 9.34
## 1st Qu.: 13.37 1st Qu.: 57.51
## Median : 16.36 Median : 96.22
## Mean : 20.19 Mean : 146.11
## 3rd Qu.: 21.18 3rd Qu.: 163.03
## Max. :409.68 Max. :6929.31
#每個訂單單次購買多少商品
order_id_cost %>% ggplot(aes(x = items_buy))+
geom_bar(fill="#006699",alpha = 0.7)+
xlab("number of items purchased")+
ggtitle("Total items purchased per transaction")+
geom_text(stat = "count",
aes(label = after_stat(count)),
size = 3,vjust = -0.2)+
scale_x_continuous(breaks = seq(0, max(order_id_cost$items_buy), by = 1))
#每次各消費數量的金額
order_id_cost %>% ggplot(aes(x =as.factor(items_buy) ,y = product_total_cost))+
geom_boxplot()+
xlab("number of items purchased")+
ggtitle("Total cost by number of items purchased")
一次買2件以上的訂單數有:9803人(約9.94%)
order_id_cost[(which(order_id_cost$items_buy>=2)),] %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 9803
cat("ratio:",round(9803/nrow(order_id_cost)*100, digits = 4),"%")
## ratio: 9.9355 %
#####customer_id_cost#####
customer_id_cost <- orders_customer2_items2 %>%
group_by(customer_unique_id,order_id) %>%
summarise(
items_buy = n(),#總共買多少商品
n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
n_distinct_product = n_distinct(product_id),#同一訂單一次買多少種不同產品
product_total_cost = sum(price),#總商品花費
Fright_total_cost = sum(freight_value),#總商品運費
Total_cost = sum(price)+sum(freight_value),#總花費
max_cost = max(price), #消費單價最高的商品價錢
min_cost = min(price), #消費單價最低的商品價錢
avg_Fright = round(mean(freight_value),digits = 2),#平均運費
avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
)
87.6%的消費者屬於一次性消費,約有9.7%的消費者會再次回購
##單一customer一次的購買次數
customer_id_n <- orders_customer2_items2 %>%
count(customer_unique_id, sort = TRUE) %>%
count(n, sort = TRUE) %>%
mutate(ratio = round(nn/sum(nn), digits =3))
customer_id_n %>%
mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",nn,")")) %>%
ggplot(aes(x = reorder(n,nn),
y = ratio))+
geom_col(alpha = 0.7)+
geom_col(data = customer_id_n %>% filter(n=="1"),fill="#E69F00")+
xlab("Repurchase frequency")+
ylab("Ratio of people")+
coord_flip()+
ggtitle("The repurchase frequency per customer in the store")+
geom_text(aes(label = lable1,
hjust =ifelse(nn > 9000, 1.2, -0.3)),
size =3)+
scale_y_continuous(labels = scales::percent)
second_purchase <- orders_customer2_items2 %>%
group_by(customer_unique_id) %>%
summarise(
purchase_time = n(),#總共買多少商品
first_buy = min(floor_date(order_purchase_timestamp,"day")),# 最早購買日期
recent = sort(floor_date(order_purchase_timestamp,"day"), decreasing = F)[2],# 第二次購買日期
since = recent-first_buy#二次回購時間
)
second_purchase1 <- second_purchase %>% filter(since>0) %>%
mutate(
days =round(as.double(since, units="days"),digits =1)
)
cat("mean:",mean(second_purchase1$days),"天 \n")
## mean: 116.7948 天
cat("variance:",var(second_purchase1$days),"天")
## variance: 13321.11 天
平均回購時間為116天,高峰期為24天。
purchase_label <- second_purchase1 %>% count(days)
density_est <- density(second_purchase1$days)
peak_days <- density_est$x[which.max(density_est$y)]
second_purchase1 %>% ggplot(aes(x = days))+
geom_density()+
xlab("Days")+
ggtitle("Time interval to the second purchase")+
geom_vline(aes(xintercept = mean(days)), color = "red", linetype = "dashed", linewidth = 0.8)+
geom_vline(aes(xintercept = peak_days), color = "blue", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = mean(second_purchase1$days), y = 0, label = paste("Mean:", round(mean(second_purchase1$days), 2)), vjust = -5,hjust = -0.05, color = "red") +
annotate("text", x = peak_days, y = 0, label = paste("Peak:", round(peak_days, 2)), vjust = -10, hjust = -0.03,color = "blue")+
theme(
plot.title = element_text(size = 15),
axis.text = element_text(size = 10),
axis.title = element_text(size = 10) )
items2_orders2_yeear<- items2_orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
summarise(
orders=n(),
revenue=sum(price)
)
p1 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2016) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ylim(0,1000000)+
ggtitle("2016 mounth revenue")
p2 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2017) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ggtitle("2017 mounth revenue")
p3 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2018) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = revenue))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ggtitle("2018 mounth revenue")
p1+p2+p3
#銷售量
p1 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2016) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ylim(0,9000)+
ggtitle("2016 mounth orders")
p2 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2017) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ylim(0,9000)+
ggtitle("2017 mounth orders")
p3 <- items2_orders2_yeear %>%
filter(purchase_timestamp_year==2018) %>%
ggplot(aes(x =as.factor(purchase_timestamp_month),y = orders))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("month")+
ylim(0,9000)+
ggtitle("2018 mounth orders")
p1+p2+p3
銷售額及銷售量逐年成長,趨於穩定,2017.11月有高峰期。
單獨拉出2017年11月的銷售額及銷售量來看,11/24 銷售額及銷售量特別驚人,是黑色星期五特殊促銷日,銷售量達到頂峰,與平日差距較大。
items2_orders2_11<- items2_orders2 %>%
filter(purchase_timestamp_year==2017,purchase_timestamp_month==11) %>%
group_by(purchase_timestamp_day) %>%
summarise(
orders=n(),
revenue=sum(price)
)
p1 <- items2_orders2_11 %>%
ggplot(aes(x =as.factor(purchase_timestamp_day),y = revenue))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("day")+
ggtitle("2017.11 revenue")
p2 <- items2_orders2_11 %>%
ggplot(aes(x =as.factor(purchase_timestamp_day),y = orders))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4)+
xlab("day")+
ggtitle("2017.11 orders")
p1
p2
週銷量平均在157之間
daily_orders <- orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week) %>%
summarise(orders = n())
month_orders <- orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
summarise(orders = n())
summary(daily_orders)
## purchase_timestamp_year purchase_timestamp_month purchase_timestamp_day purchase_timestamp_week
## Min. :2016 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2017 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2017 Median : 6.000 Median :16.00 Median :4.000
## Mean :2017 Mean : 5.822 Mean :15.64 Mean :3.997
## 3rd Qu.:2018 3rd Qu.: 8.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2018 Max. :12.000 Max. :31.00 Max. :7.000
## orders
## Min. : 1.00
## 1st Qu.: 97.75
## Median : 145.50
## Mean : 157.62
## 3rd Qu.: 213.00
## Max. :1147.00
barplot:
1. 每週下單量多集中大部分集中0~300個之間。
2. 禮拜五、六、日有些零散的離群值,下單數量波動較大。
daily_orders %>%
ggplot(aes(x = as.factor(purchase_timestamp_week),
y = orders,group =purchase_timestamp_week)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar", width = 0.3)+
ggtitle("week sales volume")+
xlab("week") +
ylab("orders") +
geom_point(position=position_jitter(width = .15, height = 0, seed = 320), size = 0.75) +
theme_classic(base_size = 19)
violin 密度圖:
1. 禮拜一、二下單量很平均
2. 禮拜三、四、六、日有兩個趨勢
3. 禮拜五下單量偏集中
daily_orders %>%
ggplot(aes(x = as.factor(purchase_timestamp_week),
y = orders,group =purchase_timestamp_week)) +
geom_violin(color = "transparent", fill = "gray90") +
ggtitle("week sales volume")+
xlab("week") +
ylab("orders") +
geom_point(position=position_jitter(width = .15, height = 0, seed = 320), size = 0.75) +
theme_classic(base_size = 19)
#改良新增products英文名
#單次消費總金額
items2_products_cost <- items2_products1 %>%
group_by(order_id) %>%
summarise(
items_buy = n(),#總共買多少商品
n_distinct_shipping_limit_day = n_distinct(shipping_limit_date),#賣家向物流合作夥伴處理訂單的出貨限制日期。
n_distinct_product = n_distinct(product_id),#同一消費者一次買多少種不同產品
n_distinct_seller = n_distinct(seller_id),#同一訂單一次向多少買家下單
product_total_cost = sum(price),#總商品花費
product_category = ifelse(items_buy==1,product_category_name_english,"2+"),
Fright_total_cost = sum(freight_value),#總商品運費
Total_cost = sum(price)+sum(freight_value),#總花費
max_cost = max(price), #消費單價最高的商品價錢
min_cost = min(price), #消費單價最低的商品價錢
avg_Fright = round(mean(freight_value),digits = 2),#平均運費
avg_cost_nproduct = round(Total_cost/items_buy,digits = 2)#平均消費總價格(含運費)
)
######消費種類######
# 按產品類別計算銷售額,並依銷售額進行排序
top_10_categories <- items2_products %>%
group_by(product_category_name_english) %>%
summarise(
total_price = sum(price),
total_freight = sum(freight_value)
) %>%
top_n(10, total_price) %>%
arrange(desc(total_price)) %>%
pull(product_category_name_english)
# 篩選出排名前10的產品類別的數據
items2_products_top10 <- items2_products %>%
filter(product_category_name_english %in% top_10_categories)
# 畫出排名前10的箱型圖
ggplot(items2_products_top10, aes(x = product_category_name_english, y = price)) +
geom_violin() +
xlab("Product Category") +
ylab("Total Cost") +
ggtitle("Total Cost Purchased per Transaction for Top 10 Categories") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
##product_category_name 的NA項填 "None"
items2_products$product_category_name_english <-
replace(items2_products$product_category_name_english,
is.na(items2_products$product_category_name_english), "None")
sum(is.na(items2_products$product_category_name_english))
## [1] 0
1.「床 / 桌子 / 浴室」(9.9%)、
2.「健康 / 美容」(8.6%)、
3.「運動 / 休閒 」(7.7%)、
4.「傢具裝飾」(7.4%)、
5.「電腦配件」(6.9%)、
6.「家居用品」 (6.2%)、
7.「手錶 / 禮品」 (5.3%)
#熱門商品(n>700)
items2_products %>% count(product_category_name_english,sort = T) %>%
mutate(ratio = round(n/sum(n), digits =3),
lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")
) %>%
filter(n>700) %>%
ggplot(aes(x = reorder(product_category_name_english,n),
y = ratio,
fill = product_category_name_english))+
geom_col(alpha = 0.7,
show.legend = FALSE)+
xlab("product_category_name")+
coord_flip()+#翻轉XY軸
ggtitle("The ratio of Popular Products")+
geom_text(aes(label = lable1,
hjust =ifelse(n > 9000, 1.1, -0.1)),
size =3)+
scale_y_continuous(labels = scales::percent)
1.「安全與服務」(0%)、
2.「時尚兒童服飾」(0%)、
3.「 cds_dvds_音樂劇」(0%)、
4.「美食」(0%)、
5.「藝術與工藝」(0%)、
6.「時尚運動」 (0%)、
7.「家_舒適」 (0%)
#冷門商品(n<100)
items2_products %>% count(product_category_name_english,sort = T) %>%
mutate(ratio = round(n/sum(n), digits =3),
lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")
) %>%
filter(n<100) %>%
ggplot(aes(x = reorder(product_category_name_english,n),
y = ratio,
fill = product_category_name_english))+
geom_col(alpha = 0.7,
show.legend = FALSE)+
xlab("product_category_name")+
coord_flip()+#翻轉XY軸
ggtitle("The ratio of Niche Products")+
geom_text(aes(label = lable1,
hjust =ifelse(n > 50, 1.1, -0.1)),
size =3)+
scale_y_continuous(labels = scales::percent)
商品總共72類
unique(items2_products$product_category_name_english)
## [1] "cool_stuff" "pet_shop"
## [3] "furniture_decor" "perfumery"
## [5] "garden_tools" "housewares"
## [7] "telephony" "health_beauty"
## [9] "books_technical" "fashion_bags_accessories"
## [11] "bed_bath_table" "sports_leisure"
## [13] "consoles_games" "office_furniture"
## [15] "luggage_accessories" "food"
## [17] "agro_industry_and_commerce" "electronics"
## [19] "computers_accessories" "construction_tools_construction"
## [21] "audio" "baby"
## [23] "construction_tools_lights" "toys"
## [25] "stationery" "industry_commerce_and_business"
## [27] "watches_gifts" "auto"
## [29] "None" "home_appliances"
## [31] "kitchen_dining_laundry_garden_furniture" "air_conditioning"
## [33] "home_confort" "fixed_telephony"
## [35] "small_appliances_home_oven_and_coffee" "diapers_and_hygiene"
## [37] "signaling_and_security" "musical_instruments"
## [39] "small_appliances" "costruction_tools_garden"
## [41] "art" "home_construction"
## [43] "books_general_interest" "party_supplies"
## [45] "construction_tools_safety" "cine_photo"
## [47] "fashion_underwear_beach" "fashion_male_clothing"
## [49] "food_drink" "drinks"
## [51] "furniture_living_room" "market_place"
## [53] "music" "fashion_shoes"
## [55] "flowers" "home_appliances_2"
## [57] "fashio_female_clothing" "computers"
## [59] "books_imported" "christmas_supplies"
## [61] "furniture_bedroom" "home_comfort_2"
## [63] "dvds_blu_ray" "cds_dvds_musicals"
## [65] "arts_and_craftmanship" "furniture_mattress_and_upholstery"
## [67] "tablets_printing_image" "costruction_tools_tools"
## [69] "fashion_sport" "la_cuisine"
## [71] "security_and_services" "fashion_childrens_clothes"
大多數訂單都是使用信用卡支付的,第二常用的付款方式是 Boleto (銀行轉帳支付) 。
payment_type_n <- order_payments %>% count(payment_type) %>%
mutate(ratio = round(n/sum(n),digits = 4))
payment_type_n %>%
mutate(lable1 = str_c(scales::percent(ratio)," ", "(",n,")")) %>%
ggplot(aes(x = reorder(payment_type,-ratio),
y = ratio))+
geom_col(fill="#006699",alpha = 0.7)+
xlab("payment type")+
ggtitle("The ratio of payment type")+
geom_text(aes(label = lable1,
vjust = -0.3 ),
size =3.5)+
scale_y_continuous(labels = scales::percent)
#number of installments chosen by the customer. 客戶選擇的分期付款次數。
library("RColorBrewer") #display.brewer.pal()
library(colorspace)
payment_installments_n <- order_payments %>% count(payment_installments) %>%
mutate(ratio = round(n/sum(n),digits = 3))
payment_installments_n %>%
mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>%
ggplot(aes(x = reorder(payment_installments,ratio),
y = ratio,
fill=as.factor(payment_installments)))+
geom_col(alpha = 0.7)+
xlab("payment installments")+
coord_flip()+
ggtitle("Percentage of customers choosing installment terms")+
geom_text(aes(label = lable1,
hjust =ifelse(n > 50000, 1.3, -0.2)
),
color = ifelse(payment_installments_n$n > 50000, "white","black"),
size =3.5)+
scale_y_continuous(labels = scales::percent)+
scale_fill_discrete_sequential(palette = "Blues 3",
nmax = length(payment_installments_n$payment_installments),
rev = FALSE,
name = "payment installments")
##payment_value
summary(order_payments)
## order_id payment_sequential payment_type payment_installments payment_value
## Length:103886 Min. : 1.000 Length:103886 Min. : 0.000 Min. : 0.00
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.: 1.000 1st Qu.: 56.79
## Mode :character Median : 1.000 Mode :character Median : 1.000 Median : 100.00
## Mean : 1.093 Mean : 2.853 Mean : 154.10
## 3rd Qu.: 1.000 3rd Qu.: 4.000 3rd Qu.: 171.84
## Max. :29.000 Max. :24.000 Max. :13664.08
透過星期與下單時間的熱力圖發現:下單高峰時間為星期一到五的10點~23點之間,且週日18點至23點之間也有一波下單趨勢。
#####熱力圖#####
library(pheatmap)
#星期和下單時間的熱力圖
heatmap_data <- table(orders2$purchase_timestamp_week, orders2$purchase_timestamp_hour)
pheatmap(heatmap_data,
cluster_rows = F, # 對行進行聚類
cluster_cols = F, # 對列進行聚類
scale = "none", # 按行縮放數據 'none', 'row' or 'column'
color = rev(
colorspace::sequential_hcl(
n = n_distinct(orders2$purchase_timestamp_hour),
palette = "Purples 3") ),
treeheight_col = 10, #刪除分支圖
treeheight_row =10, #刪除分支圖
legend =T,
main = "2016.9 ~2018.8 Heatmap between weeks and hours", # 標題
display_numbers=T, #單位空格顯示數字
number_color = "black",
number_format = "%.0f",#單位數字顯示小數位
fontsize =10, #標題大小
fontsize_number = 8, #單位空格大小
angle_col = 0 # 逆時針旋轉90度
)
拆分年度觀察每週的下單概況:
2016年資料較少,暫不做評論。2017及2018年的下單時間很相似。
推測:下單量可能同時受時間與星期影響。
heatmap_plot <- function(heatmap_data, main_title){
pheatmap(heatmap_data,
cluster_rows = F, # 對行進行聚類
cluster_cols = F, # 對列進行聚類
scale = "none", # 按行縮放數據 'none', 'row' or 'column'
color = rev(
colorspace::sequential_hcl(
n = n_distinct(orders2$purchase_timestamp_hour),
palette = "Purples 3") ),
legend =T,
main = main_title,# 標題
display_numbers=T,
number_color ="black",
number_format = "%.0f",
fontsize =10,
fontsize_number = 8,
angle_col = 0 # 逆时针旋转90度
)
}
orders2_2016 <- orders2 %>% filter(purchase_timestamp_year=="2016")
heatmap_data <- table(orders2_2016$purchase_timestamp_week, orders2_2016$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2016 Heatmap between weeks and hours")
orders2_2017 <- orders2 %>% filter(purchase_timestamp_year=="2017")
heatmap_data <- table(orders2_2017$purchase_timestamp_week, orders2_2017$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2017 Heatmap between weeks and hours")
orders2_2018 <- orders2 %>% filter(purchase_timestamp_year=="2018")
heatmap_data <- table(orders2_2018$purchase_timestamp_week, orders2_2018$purchase_timestamp_hour)
heatmap_plot(heatmap_data,main = "2018 Heatmap between weeks and hours")
獨立性檢定 ( 卡方檢驗 ): 星期、時間是否和銷售量存在顯著關聯
\(H_0\) : 星期、時間和銷售量不存在顯著相關\(\quad\) vs \(\quad\) \(H_1\) : 星期、時間和銷售量存在顯著相關
推論: p-value < 0.05,星期、時間和銷售量存在顯著相關。
不同的星期中,消費者的購買行為在一天內不同的時間存在顯著差異。
heatmap_data <- table(orders2$purchase_timestamp_week, orders2$purchase_timestamp_hour)
chisq.test(heatmap_data)
##
## Pearson's Chi-squared test
##
## data: heatmap_data
## X-squared = 1138.5, df = 138, p-value < 2.2e-16
消費者的城市分佈
消費者中 42% 來自SP(聖保羅)、12.9% 來自RJ(里約熱內盧)、11.7% 來自MG(米納斯吉拉斯州)
超過半數消費者來自這些州(66.6%)。
customer_state_n <- customer %>% count(customer_state, sort = TRUE) %>%
mutate(ratio = round(n/sum(n), digits =3))
customer_state_n %>%
mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>%
ggplot(aes(x = reorder(customer_state,n),
y = ratio))+
geom_col(alpha = 0.7)+
geom_col(data = customer_state_n %>% filter( (customer_state=="SP") | (customer_state=="RJ") | (customer_state=="MG")),fill="#E69F00")+
xlab("customer state")+
coord_flip()+
ggtitle("The ratio of customer state")+
geom_text(aes(label = lable1,
hjust =ifelse(n > 40000, 1.2, -0.3)),
size =3)+
scale_y_continuous(labels = scales::percent)
賣家的城市分佈
賣家中 59.7% 來自SP(聖保羅州)、11.3% 來自PR(巴拉那州)、7.9% 來自MG(米納斯吉拉斯州)。
超過半數賣家來自這些州(78.9%)。
#sellers_state barplott 城市分布圖
sellers_state_n <- sellers %>% count(seller_state, sort = TRUE) %>%
mutate(ratio = round(n/sum(n), digits =3))
sellers_state_n %>%
mutate(lable1 = str_c(scales::percent(ratio, accuracy=0.1)," ", "(",n,")")) %>%
ggplot(aes(x = reorder(seller_state,n),
y = ratio))+
geom_col(alpha = 0.7)+
geom_col(data = sellers_state_n %>% filter( (seller_state=="SP") | (seller_state=="RJ") | (seller_state=="MG")),fill="#E69F00")+
xlab("sellers state")+
coord_flip()+
ggtitle("The ratio of sellers state")+
geom_text(aes(label = lable1,
hjust =ifelse(n > 500, 1.2, -0.3)),
size =3)+
scale_y_continuous(labels = scales::percent)
#畫巴西地圖(已完成)
library(maps) #世界各地地圖
library(rnaturalearth)
library(rnaturalearthhires) # ne_states
library(sf) #simple feature
brazil <- ne_countries(country = "Brazil", returnclass = "sf")
brazil_states <- ne_states(country = "Brazil", returnclass = "sf")
# 轉換為sf格式
customer_sf <- st_as_sf(customer_zip1,
coords = c("geolocation_lng", "geolocation_lat"),
crs = 4326)
sellers_sf <- st_as_sf(sellers_zip1 ,
coords = c("geolocation_lng", "geolocation_lat"),
crs = 4326)
消費者的座標
##只有點 沒有地圖
ggplot(customer_zip1,
aes(geolocation_lng, geolocation_lat)) +
geom_point(size = .25, show.legend = FALSE) +
coord_quickmap()
消費者和賣家的座標
#customer population
g1 <- ggplot() +
geom_sf(data = brazil_states)+
geom_sf(data = customer_sf, aes(color = "red"), size = 1) + #在地圖上繪製點
geom_sf_text(data = brazil_states, aes(label = name), size = 3) + # 標記縣市名
labs(title = "customer population in Brazil") +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position = "none")
#sellers population
g2 <- ggplot() +
geom_sf(data = brazil_states)+
geom_sf(data = sellers_sf, aes(color = "red"), size = 1) +
geom_sf_text(data = brazil_states, aes(label = name), size = 3) +
labs(title = "sellers population in Brazil") +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position = "none")
g1+g2
消費者和賣家的城市分佈地理圖
從地圖看出可以觀察到消費者與賣家分佈都是以SP(聖保羅)為中心,向鄰近的州擴散。
沒有賣家來自RR (羅賴馬州) 、AP (阿馬帕州) 、TO (托坎廷斯州) 、AL (阿拉戈斯州)這四州。
cn <- customer_zip1 %>% group_by(customer_state) %>%
summarise(
count = n()
)
bn <- sellers_zip1 %>% group_by(seller_state) %>%
summarise(
count = n()
)
#customer population
#way2
brazil_states$iso_3166_2 <- str_remove(brazil_states$iso_3166_2, "BR-")
names(cn)[1] <- "iso_3166_2"
names(bn)[1] <- "iso_3166_2"
brazil_states1 <-
left_join(brazil_states, cn, by = "iso_3166_2")
brazil_states2 <-
left_join(brazil_states, bn, by = "iso_3166_2")
ggplot(customer_sf) +
geom_sf(data = brazil_states1,aes(fill = count))+
coord_sf()+
geom_sf_text(data = brazil_states1, aes(label = iso_3166_2), size = 3)+
scale_fill_continuous_sequential(palette = "Reds3", rev = T)+
labs(title = "Customer population in Brazil") +
xlab("Longitude") + ylab("Latitude")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not give correct results
## for longitude/latitude data
ggplot(sellers_sf) +
geom_sf(data = brazil_states2,aes(fill = count))+
coord_sf()+
geom_sf_text(data = brazil_states2, aes(label = iso_3166_2), size = 3)+
scale_fill_continuous_sequential(palette = "Pink Yl", rev = T)+
#limits = c(0, 40000))+
labs(title = "Sellers population in Brazil") +
xlab("Longitude") + ylab("Latitude")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not give correct results
## for longitude/latitude data
##每日訂單量density
##### dayly_orders #####
daily_orders <- orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week) %>%
summarise(orders = n())
month_orders <- orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month) %>%
summarise(orders = n())
hour_orders <- orders2 %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week,
purchase_timestamp_hour) %>%
summarise(orders = n())
每天下單量似 Poisson 分布,唯獨平均值(157.616) ≠ 變異數 (8014.05)
#####daily_orders histogram#####
daily_orders %>% ggplot(aes(x = orders)) +
geom_histogram(aes(y = after_stat(density)))+
geom_density(color = "blue", linewidth = 1)+
geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", linewidth = 1) +
ggtitle("daily_orders density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cat("mean:",mean(daily_orders$orders),"\n")
## mean: 157.616
cat("variance:",var(daily_orders$orders),"\n")
## variance: 8014.05
cat("max:",max(daily_orders$orders))
## max: 1147
扣除 11/24 那天的銷售量(離群值),依舊平均值155.9967 ≠ 變異數
6419.839
不適合 Poisson 分佈,考慮以 Negative Binomial Regression
建立預測模型。
daily_orders1 <- daily_orders[-which(daily_orders$orders==1147),]
daily_orders1 %>% ggplot(aes(x = orders)) +
geom_histogram(aes(y = after_stat(density)))+
geom_density(color = "blue", linewidth = 1)+
geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", size = 1) +
ggtitle("daily_orders density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cat("mean:",mean(daily_orders1$orders),"\n")
## mean: 155.9967
cat("variance:",var(daily_orders1$orders),"\n")
## variance: 6419.839
cat("max:",max(daily_orders1$orders))
## max: 487
\(y\) ~ \(week\)
#####初始 negbin_model #####
library(MASS)
daily_orders$purchase_timestamp_week <- as.factor(daily_orders$purchase_timestamp_week)
negbin_model <- glm.nb(orders~ purchase_timestamp_week , data = daily_orders)
summary(negbin_model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = daily_orders,
## init.theta = 3.010848428, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.184270 0.061951 83.684 < 2e-16 ***
## purchase_timestamp_week2 -0.001454 0.087864 -0.017 0.98680
## purchase_timestamp_week3 -0.029452 0.087874 -0.335 0.73751
## purchase_timestamp_week4 -0.092195 0.087647 -1.052 0.29285
## purchase_timestamp_week5 -0.137844 0.087665 -1.572 0.11586
## purchase_timestamp_week6 -0.386487 0.088036 -4.390 1.13e-05 ***
## purchase_timestamp_week7 -0.288577 0.087986 -3.280 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.0108) family taken to be 1)
##
## Null deviance: 687.61 on 611 degrees of freedom
## Residual deviance: 653.91 on 605 degrees of freedom
## AIC: 7130.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.011
## Std. Err.: 0.170
##
## 2 x log-likelihood: -7114.126
基準值為禮拜一,禮拜一、六、日為影響銷售量的顯著性變數,表中顯示禮拜六、日的訂單數顯著減少。
par(mfrow = c(2, 2))
plot(negbin_model)
Residuals plot:
Residuals vs Fitting value:Residuals 隨機分佈在0周圍,只有幾個高Residuals 值(334、335、336)。
QQ plot:資料點大致上沿著45度對角線分佈,右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。
scale location plot:標準化 Residuals 平方根對於 Fitting value 平方根分佈情況,圖呈現隨機分佈,為同質變異數。
residuals vs leverage plot:標準化 Residuals 對於槓桿點的分佈情況。此圖表示有零星高槓桿點與離群值(334、335、336)。
結論:
模型擬合情況:大致上良好,存在一些高 Residuals 值,需進一步檢查。
Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。
Scale-Location:表明Residuals 的變異數是同質的,不存在異質變異數問題。
異常值和高槓桿點:存在一些高槓桿點,需進一步檢查。
總體上是一個合理的模型,但存在一些高 Residuals 值和高槓桿點,需進一步檢查。
檢查觀測值:
第334、335、336觀測值分別是11/24、11/25、11/26,因活動造成訂單高峰的值,考慮刪除並重新建模。
daily_orders[c(334,335,336),]
## # A tibble: 3 × 5
## # Groups: purchase_timestamp_year, purchase_timestamp_month, purchase_timestamp_day [3]
## purchase_timestamp_y…¹ purchase_timestamp_m…² purchase_timestamp_day purchase_timestamp_w…³ orders
## <dbl> <dbl> <int> <fct> <int>
## 1 2017 11 24 5 1147
## 2 2017 11 25 6 487
## 3 2017 11 26 7 382
## # ℹ abbreviated names: ¹purchase_timestamp_year, ²purchase_timestamp_month,
## # ³purchase_timestamp_week
daily_orders1 <- daily_orders[-c(334,335,336),]
daily_orders1$purchase_timestamp_week <- as.factor(daily_orders1$purchase_timestamp_week)
negbin_model1 <- glm.nb(orders~ purchase_timestamp_week , data = daily_orders1)
summary(negbin_model1)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = daily_orders1,
## init.theta = 3.211134206, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.184270 0.060021 86.374 < 2e-16 ***
## purchase_timestamp_week2 -0.001454 0.085126 -0.017 0.986372
## purchase_timestamp_week3 -0.029452 0.085137 -0.346 0.729394
## purchase_timestamp_week4 -0.092195 0.084919 -1.086 0.277619
## purchase_timestamp_week5 -0.213978 0.085216 -2.511 0.012039 *
## purchase_timestamp_week6 -0.422196 0.085575 -4.934 8.07e-07 ***
## purchase_timestamp_week7 -0.310405 0.085513 -3.630 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.2111) family taken to be 1)
##
## Null deviance: 694.10 on 608 degrees of freedom
## Residual deviance: 650.28 on 602 degrees of freedom
## AIC: 7045.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.211
## Std. Err.: 0.183
##
## 2 x log-likelihood: -7029.777
2 x log-likelihood上升:模型解釋力變好
AIC下降:模型複雜度下降
par(mfrow = c(2, 2))
plot(negbin_model1)
結論:剔除異常點:334、335、336觀測值後
Residuals 分析改進:剔除異常值,解決高 Residuals 值和高槓桿點問題,新模型表現更穩定。
AIC 减少:新模型複雜度更低。
2 x log-likelihood 提升:新模型擬合度提高,解釋力更強。
隨機分割資料集,0.7個資料集做訓練,0.3個資料集做測試。
######驗證#####
library(caret)
set.seed(123)
train_index <- createDataPartition(daily_orders1$orders, p = 0.7, list = FALSE)
train_data <- daily_orders1[train_index, ]
test_data <- daily_orders1[-train_index, ]
# train_data
model <- glm.nb(formula = orders ~ purchase_timestamp_week, data = train_data, link = log)
summary(model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week, data = train_data,
## link = log, init.theta = 3.216424384)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.22455 0.07085 73.736 < 2e-16 ***
## purchase_timestamp_week2 -0.04940 0.09837 -0.502 0.615527
## purchase_timestamp_week3 -0.12753 0.10151 -1.256 0.208977
## purchase_timestamp_week4 -0.09532 0.10487 -0.909 0.363383
## purchase_timestamp_week5 -0.28853 0.09957 -2.898 0.003758 **
## purchase_timestamp_week6 -0.46124 0.10171 -4.535 5.76e-06 ***
## purchase_timestamp_week7 -0.34961 0.10163 -3.440 0.000582 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.2164) family taken to be 1)
##
## Null deviance: 491.46 on 428 degrees of freedom
## Residual deviance: 457.96 on 422 degrees of freedom
## AIC: 4965.3
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.216
## Std. Err.: 0.219
##
## 2 x log-likelihood: -4949.344
predictions <- predict(model, test_data, type = "response")
# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE: 5926.13496475934"
print(paste("RMSE: ", rmse))
## [1] "RMSE: 76.9813936270274"
print(paste("AIC: ", aic))
## [1] "AIC: 4965.34443471231"
##
train_data %>% ggplot(aes(x = orders)) +
geom_histogram(aes(y = ..density..))+
geom_density(color = "blue", size = 1)+
geom_vline(aes(xintercept = mean(orders)), color = "red", linetype = "dashed", size = 1) +
ggtitle("daily_orders density")
print(paste("mean of train_data:",mean(train_data$orders)))
## [1] "mean of train_data: 154.839160839161"
##"Actual vs. Predicted Orders
library(ggplot2)
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs. Predicted Orders",
x = "Actual Orders",
y = "Predicted Orders") +
theme_minimal()
交叉驗證:k = 10
# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week,
data = daily_orders1,
method = "glm.nb",
trControl = train_control)
print(model)
## Negative Binomial Generalized Linear Model
##
## 609 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 547, 548, 548, 549, 548, 548, ...
## Resampling results across tuning parameters:
##
## link RMSE Rsquared MAE
## identity 75.66016 0.08889727 63.28178
## log 75.66016 0.08889727 63.28178
## sqrt 75.66016 0.08889727 63.28178
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = log.
模擬效果不太好,實際值與預測值相差很大。
訂單數範圍落在:0~487之間,平均絕對誤差卻有63,均方根誤差也有75,表示預測值與實際值平均差距很大。
模型只能解釋預測值大约8.88%的變異,意味著模型沒有很好的捕捉到數據中的模式。
各指標判斷:
Mean Absolute Error (MAE)
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 63.28178 \]
Root Mean Squared Error (RMSE)
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 75.66016 \] Coefficient of Determination (R²)
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.08889727 = 8.88\% \]
加入昨日訂單量的差異
######創建lag1變數和每日差異變數#####
daily_orders1$lag1_orders <- dplyr::lag(daily_orders1$orders, 1)
daily_orders1$diff_orders <- daily_orders1$orders - dplyr::lag(daily_orders1$orders, 1)
# 移除NA值
daily_orders2 <- na.omit(daily_orders1)
##畫每日變動量
daily_orders2$N <- c(1:nrow(daily_orders2))
daily_orders2 %>%
ggplot(aes(x=N,diff_orders)) +
geom_line()+
ggtitle("Differential variable for daily orders")
\(y\) ~ \(week + diff\_orders\)
##negbin_model2
negbin_model2 <- glm.nb(orders~ purchase_timestamp_week+ diff_orders, data = daily_orders2)
summary(negbin_model2)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders,
## data = daily_orders2, init.theta = 3.355518393, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.0355398 0.0701814 71.750 < 2e-16 ***
## purchase_timestamp_week2 0.1569526 0.0923791 1.699 0.08932 .
## purchase_timestamp_week3 0.1274052 0.0934646 1.363 0.17284
## purchase_timestamp_week4 0.0943661 0.0948430 0.995 0.31975
## purchase_timestamp_week5 -0.0155281 0.0986774 -0.157 0.87496
## purchase_timestamp_week6 -0.1906683 0.1039308 -1.835 0.06657 .
## purchase_timestamp_week7 -0.2051582 0.0880204 -2.331 0.01976 *
## diff_orders 0.0024336 0.0008064 3.018 0.00254 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.3555) family taken to be 1)
##
## Null deviance: 702.59 on 607 degrees of freedom
## Residual deviance: 646.60 on 600 degrees of freedom
## AIC: 7014.6
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.356
## Std. Err.: 0.192
##
## 2 x log-likelihood: -6996.640
表中顯示每日變動量為影響銷售量的顯著性變數。
par(mfrow = c(2, 2))
plot(negbin_model2)
Residuals plot:
Residuals vs Fitting value:Residuals 分佈在0周圍,分成兩群,似乎有某種趨勢。
QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。
scale location plot:圖呈現分成兩群,似乎有某種趨勢。
residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。
結論:
模型擬合情況:似乎有一些趨勢,需進一步檢查。
Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。
Scale-Location:表明Residuals 的變異數不同,存在異質變異數問題。
異常值和高槓桿點:不存在異常值與高槓桿點。
總體上增加了模型解釋力,但存在某種關係,需進一步調整。
###驗證
set.seed(123)
train_index <- createDataPartition(daily_orders2$orders, p = 0.7, list = FALSE)
train_data <- daily_orders2[train_index, ]
test_data <- daily_orders2[-train_index, ]
# train_data
model <- glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders, data = train_data, link = log)
summary(model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders,
## data = train_data, link = log, init.theta = 3.357073728)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.0389982 0.0810701 62.156 < 2e-16 ***
## purchase_timestamp_week2 0.1837154 0.1069400 1.718 0.08581 .
## purchase_timestamp_week3 0.0695819 0.1104930 0.630 0.52886
## purchase_timestamp_week4 0.0596925 0.1116288 0.535 0.59283
## purchase_timestamp_week5 -0.0125628 0.1148012 -0.109 0.91286
## purchase_timestamp_week6 -0.1698997 0.1210926 -1.403 0.16060
## purchase_timestamp_week7 -0.2364969 0.1030128 -2.296 0.02169 *
## diff_orders 0.0028126 0.0009253 3.040 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.3571) family taken to be 1)
##
## Null deviance: 499.43 on 427 degrees of freedom
## Residual deviance: 455.99 on 420 degrees of freedom
## AIC: 4941.2
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.357
## Std. Err.: 0.229
##
## 2 x log-likelihood: -4923.218
predictions <- predict(model, test_data, type = "response")
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE: 5748.59644159849"
print(paste("RMSE: ", rmse))
## [1] "RMSE: 75.8194990856474"
print(paste("AIC: ", aic))
## [1] "AIC: 4941.21753255967"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs. Predicted Orders",
x = "Actual Orders",
y = "Predicted Orders") +
theme_minimal()
交叉驗證:k = 10
# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week + diff_orders,
data = daily_orders2,
method = "glm.nb",
trControl = train_control)
print(model)
## Negative Binomial Generalized Linear Model
##
## 608 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 546, 547, 548, 548, 547, 547, ...
## Resampling results across tuning parameters:
##
## link RMSE Rsquared MAE
## identity 75.00710 0.1033302 62.86533
## log 74.50003 0.1206841 62.89483
## sqrt 74.70994 0.1144106 63.05278
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = log.
模擬效果依舊有些差距,不過實際值與預測值有改善,添加方向看似是正確的。
交叉驗證比對後,平均誤差沒什麼變化,且解釋力沒有顯著提升。
各指標判斷:
Mean Absolute Error (MAE)
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 62.86533 \]
Root Mean Squared Error (RMSE)
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 75.00710 \] Coefficient of Determination (R²)
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.1033302 = 10.33\% \]
增加早上下午晚上凌晨時段資料
##增加時段資料
hour_orders <- hour_orders %>%
mutate(period = case_when(
purchase_timestamp_hour >= 0 & purchase_timestamp_hour <= 6 ~ "early_morning",
purchase_timestamp_hour > 6 & purchase_timestamp_hour <= 12 ~ "morning",
purchase_timestamp_hour > 12 & purchase_timestamp_hour <= 18 ~ "afternoon",
purchase_timestamp_hour > 18 & purchase_timestamp_hour <=24 ~ "night"
))
period_orders <- hour_orders %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week,period) %>%
summarise(period_orders = sum(orders))
#values_fill參數將缺失值填補為0。
period_orders <- period_orders %>%
pivot_wider(names_from = period,
values_from = period_orders,
values_fill = list(period_orders = 0))%>%
mutate(orders = sum(morning+afternoon+night+early_morning))
#####增加早上下午晚上凌晨時段資料#####
#取每日差異變數
period_orders$lag1_orders <- dplyr::lag(period_orders$orders, 1)
period_orders$diff_orders <- abs(period_orders$orders - dplyr::lag(period_orders$orders, 1))
period_orders$lag1_morning <- dplyr::lag(period_orders$morning, 1)
period_orders$diff_morning <- abs(period_orders$morning - dplyr::lag(period_orders$morning, 1))
period_orders$lag1_afternoon <- dplyr::lag(period_orders$afternoon, 1)
period_orders$diff_afternoon <- abs(period_orders$afternoon - dplyr::lag(period_orders$afternoon, 1))
period_orders$lag1_night <- dplyr::lag(period_orders$night, 1)
period_orders$diff_night <- abs(period_orders$night - dplyr::lag(period_orders$night, 1))
period_orders$lag1_early_morning <- dplyr::lag(period_orders$early_morning, 1)
period_orders$diff_early_morning <- abs(period_orders$early_morning - dplyr::lag(period_orders$early_morning, 1))
# 移除NA值
period_orders1 <- na.omit(period_orders)
\(y\) ~ \(week + diff\_orders + diff\_morning + diff\_afternoon + diff\_night + diff\_early\_morning\)
period_orders1 <- period_orders1[-c(333,334,335),]
period_orders1$purchase_timestamp_week <- as.factor(period_orders1$purchase_timestamp_week)
period_orders1$purchase_timestamp_month <- as.factor(period_orders1$purchase_timestamp_month)
negbin_model3 <- glm.nb(orders~ purchase_timestamp_week+diff_orders+
diff_morning+diff_afternoon+
diff_night+diff_early_morning,
data = period_orders1)
summary(negbin_model3)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders +
## diff_morning + diff_afternoon + diff_night + diff_early_morning,
## data = period_orders1, init.theta = 4.447943597, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.360563 0.071757 60.768 < 2e-16 ***
## purchase_timestamp_week2 0.305134 0.077890 3.918 8.95e-05 ***
## purchase_timestamp_week3 0.289908 0.077665 3.733 0.000189 ***
## purchase_timestamp_week4 0.225370 0.077785 2.897 0.003764 **
## purchase_timestamp_week5 0.079473 0.077542 1.025 0.305413
## purchase_timestamp_week6 -0.207187 0.075136 -2.757 0.005825 **
## purchase_timestamp_week7 0.004698 0.080386 0.058 0.953398
## diff_orders -0.002219 0.001484 -1.496 0.134771
## diff_morning 0.014275 0.002457 5.809 6.30e-09 ***
## diff_afternoon 0.013420 0.002428 5.527 3.26e-08 ***
## diff_night 0.015152 0.002220 6.825 8.79e-12 ***
## diff_early_morning 0.026243 0.006287 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.4479) family taken to be 1)
##
## Null deviance: 918.40 on 607 degrees of freedom
## Residual deviance: 645.85 on 596 degrees of freedom
## AIC: 6849.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.448
## Std. Err.: 0.262
##
## 2 x log-likelihood: -6823.931
表中顯示每日各時段的變動量為影響銷售量的顯著性變數,反而有了時段的變動量後,每日變動量變不重要了。
par(mfrow = c(2, 2))
plot(negbin_model3)
Residuals plot:
Residuals vs Fitting value:Residuals 分佈在0周圍,集中在一起,沒有明顯趨勢。
QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。
scale location plot:分佈在0周圍,集中在一起,沒有明顯趨勢。
residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。
結論:
模型擬合情況:看似良好。
Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。
Scale-Location:表明Residuals 的變異數大致相同,不存在異質變異數問題。
異常值和高槓桿點:不存在異常值與高槓桿點。
總體上增加了模型解釋力。
set.seed(123)
train_index <- createDataPartition(period_orders1$orders, p = 0.7, list = FALSE)
train_data <- period_orders1[train_index, ]
test_data <- period_orders1[-train_index, ]
# train_data
model <- glm.nb(formula = orders ~ purchase_timestamp_week+diff_orders+
diff_morning+diff_afternoon+
diff_night+diff_early_morning,
data = train_data, link = log)
summary(model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders +
## diff_morning + diff_afternoon + diff_night + diff_early_morning,
## data = train_data, link = log, init.theta = 4.504909298)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.400898 0.082438 53.385 < 2e-16 ***
## purchase_timestamp_week2 0.307413 0.090264 3.406 0.00066 ***
## purchase_timestamp_week3 0.190859 0.090466 2.110 0.03488 *
## purchase_timestamp_week4 0.161698 0.091853 1.760 0.07834 .
## purchase_timestamp_week5 0.064458 0.090547 0.712 0.47655
## purchase_timestamp_week6 -0.222187 0.089679 -2.478 0.01323 *
## purchase_timestamp_week7 -0.037956 0.094175 -0.403 0.68692
## diff_orders -0.002486 0.001689 -1.472 0.14106
## diff_morning 0.014189 0.002856 4.968 6.75e-07 ***
## diff_afternoon 0.014314 0.002943 4.864 1.15e-06 ***
## diff_night 0.015563 0.002477 6.282 3.34e-10 ***
## diff_early_morning 0.018249 0.007718 2.364 0.01807 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.5049) family taken to be 1)
##
## Null deviance: 660.07 on 427 degrees of freedom
## Residual deviance: 456.00 on 416 degrees of freedom
## AIC: 4823
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.505
## Std. Err.: 0.317
##
## 2 x log-likelihood: -4797.041
predictions <- predict(model, test_data, type = "response")
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE: 4611.49800856018"
print(paste("RMSE: ", rmse))
## [1] "RMSE: 67.9080113724455"
print(paste("AIC: ", aic))
## [1] "AIC: 4823.04080163789"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs. Predicted Orders",
x = "Actual Orders",
y = "Predicted Orders") +
theme_minimal()
交叉驗證:k = 10
# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week+diff_orders+
diff_morning+diff_afternoon+
diff_night+diff_early_morning,
data = period_orders1,
method = "glm.nb",
trControl = train_control)
print(model)
## Negative Binomial Generalized Linear Model
##
## 608 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 546, 547, 548, 548, 547, 547, ...
## Resampling results across tuning parameters:
##
## link RMSE Rsquared MAE
## identity 71.42328 0.2587004 55.34819
## log 67.78825 0.3238347 51.87033
## sqrt 65.07121 0.3564331 50.51995
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.
模擬效果改善一些,看似差個斜率。
交叉驗證比對後,平均誤差下降,表示預測值與實際值誤差有改善,且解釋力提升至25%。
各指標判斷:
Mean Absolute Error (MAE)
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 55.34819 \]
Root Mean Squared Error (RMSE)
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 71.42328 \] Coefficient of Determination (R²)
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.2587004 = 25.87\% \]
增加地區資訊建模
# 透過下單日期和星期將資料分組,並計算每小時的下單量
hour_orders_customer <- orders_customer3%>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week,
purchase_timestamp_hour,customer_state) %>%
summarise(orders = n())
hour_orders_customer <- hour_orders_customer %>%
mutate(period = case_when(
purchase_timestamp_hour >= 0 & purchase_timestamp_hour <= 6 ~ "early_morning",
purchase_timestamp_hour > 6 & purchase_timestamp_hour <= 12 ~ "morning",
purchase_timestamp_hour > 12 & purchase_timestamp_hour <= 18 ~ "afternoon",
purchase_timestamp_hour > 18 & purchase_timestamp_hour <=24 ~ "night"
))
hour_orders_customer <- hour_orders_customer %>%
mutate(state = case_when(
customer_state =="SP" ~ "SP",
customer_state =="RJ" ~ "RJ",
customer_state =="MG" ~ "MG",
TRUE ~ "others"
))
##計算period
period_state_suborders1 <- hour_orders_customer %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week,period) %>%
summarise(period_orders = sum(orders))
period_state_suborders1 <- period_state_suborders1 %>%
pivot_wider(names_from = period,
values_from = period_orders,
values_fill = list(period_orders = 0))%>%
mutate(orders = sum(morning+afternoon+night+early_morning))
period_state_suborders1$key <- c(1:nrow(period_state_suborders1))
##計算state
period_state_suborders2 <- hour_orders_customer %>%
group_by(purchase_timestamp_year,purchase_timestamp_month,
purchase_timestamp_day,purchase_timestamp_week,state) %>%
summarise(state_orders = sum(orders))
period_state_suborders2 <- period_state_suborders2 %>%
pivot_wider(names_from = state,
values_from = state_orders,
values_fill = list(state_orders = 0))%>%
mutate(orders = sum(SP+RJ+MG+others))
period_state_suborders2$key <- c(1:nrow(period_state_suborders2))
sum(period_state_suborders2$orders-daily_orders$orders)
## [1] 0
#串接合併period_state_suborders1和period_state_suborders2
period_state_orders <- left_join( period_state_suborders1[,-c(9)],period_state_suborders2[,5:10], by = "key")
period_state_orders <- period_state_orders[-c(334,335,336),]
#####增加地區資料period_state_orders#####
period_state_orders$lag1_orders <- dplyr::lag(period_state_orders$orders, 1)
period_state_orders$lag2_orders <- dplyr::lag(period_state_orders$orders, 2)
period_state_orders$diff_orders <- abs(period_state_orders$lag2_orders - period_state_orders$lag1_orders)
period_state_orders$lag1_morning <- dplyr::lag(period_state_orders$morning, 1)
period_state_orders$lag2_morning <- dplyr::lag(period_state_orders$morning, 2)
period_state_orders$diff_morning <- abs(period_state_orders$lag2_morning - period_state_orders$lag1_morning)
period_state_orders$lag1_afternoon <- dplyr::lag(period_state_orders$afternoon, 1)
period_state_orders$lag2_afternoon <- dplyr::lag(period_state_orders$afternoon, 2)
period_state_orders$diff_afternoon <- abs(period_state_orders$lag2_afternoon - period_state_orders$lag1_afternoon)
period_state_orders$lag1_night <- dplyr::lag(period_state_orders$night, 1)
period_state_orders$lag2_night <- dplyr::lag(period_state_orders$night, 2)
period_state_orders$diff_night <- abs(period_state_orders$lag2_night - period_state_orders$lag1_night)
period_state_orders$lag1_early_morning <- dplyr::lag(period_state_orders$early_morning, 1)
period_state_orders$lag2_early_morning <- dplyr::lag(period_state_orders$early_morning, 2)
period_state_orders$diff_early_morning <- abs(period_state_orders$lag2_early_morning - period_state_orders$lag1_early_morning)
period_state_orders$lag1_SP <- dplyr::lag(period_state_orders$SP, 1)
period_state_orders$lag2_SP <- dplyr::lag(period_state_orders$SP, 2)
period_state_orders$diff_SP <- abs(period_state_orders$lag2_SP - period_state_orders$lag1_SP)
period_state_orders$lag1_MG <- dplyr::lag(period_state_orders$MG, 1)
period_state_orders$lag2_MG <- dplyr::lag(period_state_orders$MG, 2)
period_state_orders$diff_MG <- abs(period_state_orders$lag2_MG - period_state_orders$lag1_MG)
period_state_orders$lag1_RJ <- dplyr::lag(period_state_orders$RJ, 1)
period_state_orders$lag2_RJ <- dplyr::lag(period_state_orders$RJ, 2)
period_state_orders$diff_RJ <- abs(period_state_orders$lag2_RJ - period_state_orders$lag1_RJ)
period_state_orders$lag1_others <- dplyr::lag(period_state_orders$others, 1)
period_state_orders$lag2_others <- dplyr::lag(period_state_orders$others, 2)
period_state_orders$diff_others <- abs(period_state_orders$lag2_others - period_state_orders$lag1_others)
period_state_orders1 <- na.omit(period_state_orders)
period_state_orders1$purchase_timestamp_week <- as.factor(period_state_orders1$purchase_timestamp_week)
period_state_orders2<- period_state_orders1[-c(368),]
因變量變多,所以透過逐步迴歸方法篩選適合的變量
# 初始模型
initial_model <- glm.nb(orders ~ 1, data = period_state_orders2)
# 完全模型:包含所有可能的自變量
full_model <- glm.nb(orders ~ purchase_timestamp_week+diff_orders+
diff_morning+ diff_afternoon+ diff_night+ diff_early_morning+
lag1_morning+ lag1_afternoon+ lag1_night+ lag1_early_morning+
diff_SP+ diff_MG+ diff_RJ+ diff_others,
lag1_SP+ lag1_MG+ lag1_RJ+ lag1_others,
data = period_state_orders2)
# 前向逐步迴歸
forward_model <- step(initial_model, scope = list(lower = initial_model, upper = full_model),
direction = "forward")
## Start: AIC=7023.9
## orders ~ 1
##
## Df Deviance AIC
## + lag1_night 1 228.03 6609.1
## + lag1_afternoon 1 251.40 6632.5
## + lag1_morning 1 336.62 6717.7
## + lag1_early_morning 1 453.22 6834.3
## + diff_others 1 587.05 6968.1
## + diff_night 1 589.88 6970.9
## + diff_orders 1 590.14 6971.2
## + diff_afternoon 1 591.88 6972.9
## + diff_morning 1 592.03 6973.1
## + diff_SP 1 597.74 6978.8
## + diff_MG 1 598.47 6979.5
## + diff_RJ 1 602.59 6983.6
## + purchase_timestamp_week 6 601.16 6992.2
## + diff_early_morning 1 622.43 7003.5
## <none> 644.84 7023.9
##
## Step: AIC=6335.26
## orders ~ lag1_night
##
## Df Deviance AIC
## + lag1_afternoon 1 578.94 6246.9
## + lag1_morning 1 627.25 6295.2
## + lag1_early_morning 1 642.73 6310.7
## + diff_night 1 652.36 6320.4
## + diff_morning 1 658.43 6326.4
## + diff_RJ 1 659.36 6327.4
## + diff_SP 1 660.71 6328.7
## + purchase_timestamp_week 6 651.05 6329.0
## + diff_orders 1 662.42 6330.4
## + diff_MG 1 663.10 6331.1
## + diff_others 1 664.14 6332.1
## <none> 669.26 6335.3
## + diff_early_morning 1 667.44 6335.4
## + diff_afternoon 1 668.61 6336.6
##
## Step: AIC=6238.85
## orders ~ lag1_night + lag1_afternoon
##
## Df Deviance AIC
## + purchase_timestamp_week 6 601.17 6171.5
## + diff_night 1 672.40 6232.8
## <none> 680.48 6238.8
## + lag1_early_morning 1 678.48 6238.9
## + diff_morning 1 678.92 6239.3
## + diff_RJ 1 678.95 6239.3
## + diff_MG 1 679.45 6239.8
## + diff_afternoon 1 679.78 6240.1
## + diff_others 1 679.86 6240.2
## + diff_SP 1 680.03 6240.4
## + lag1_morning 1 680.13 6240.5
## + diff_orders 1 680.32 6240.7
## + diff_early_morning 1 680.37 6240.7
##
## Step: AIC=6164.73
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week
##
## Df Deviance AIC
## + lag1_morning 1 684.14 6154.1
## + diff_night 1 691.74 6161.7
## + lag1_early_morning 1 691.79 6161.8
## <none> 696.72 6164.7
## + diff_morning 1 695.47 6165.5
## + diff_others 1 696.01 6166.0
## + diff_MG 1 696.23 6166.2
## + diff_RJ 1 696.29 6166.3
## + diff_afternoon 1 696.37 6166.4
## + diff_SP 1 696.44 6166.4
## + diff_orders 1 696.46 6166.5
## + diff_early_morning 1 696.51 6166.5
##
## Step: AIC=6153.98
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning
##
## Df Deviance AIC
## + diff_night 1 694.47 6150.6
## + lag1_early_morning 1 696.85 6153.0
## <none> 699.87 6154.0
## + diff_afternoon 1 699.02 6155.1
## + diff_morning 1 699.25 6155.4
## + diff_others 1 699.33 6155.4
## + diff_RJ 1 699.66 6155.8
## + diff_MG 1 699.68 6155.8
## + diff_SP 1 699.75 6155.9
## + diff_orders 1 699.81 6155.9
## + diff_early_morning 1 699.81 6155.9
##
## Step: AIC=6150.55
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night
##
## Df Deviance AIC
## + lag1_early_morning 1 696.88 6149.1
## <none> 700.29 6150.5
## + diff_afternoon 1 699.48 6151.7
## + diff_morning 1 699.84 6152.1
## + diff_orders 1 700.07 6152.3
## + diff_others 1 700.09 6152.3
## + diff_RJ 1 700.25 6152.5
## + diff_MG 1 700.25 6152.5
## + diff_SP 1 700.25 6152.5
## + diff_early_morning 1 700.28 6152.5
##
## Step: AIC=6149.13
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning
##
## Df Deviance AIC
## <none> 700.83 6149.1
## + diff_afternoon 1 699.98 6150.3
## + diff_morning 1 700.14 6150.4
## + diff_others 1 700.38 6150.7
## + diff_MG 1 700.72 6151.0
## + diff_orders 1 700.73 6151.0
## + diff_RJ 1 700.78 6151.1
## + diff_SP 1 700.80 6151.1
## + diff_early_morning 1 700.81 6151.1
summary(forward_model )
##
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2,
## init.theta = 16.2117696, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.0908933 0.0392028 104.352 < 2e-16 ***
## lag1_night 0.0093107 0.0009775 9.525 < 2e-16 ***
## lag1_afternoon 0.0071557 0.0009763 7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990 0.0429577 -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043 0.0431882 -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291 0.0445871 -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511 0.0451792 -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728 0.0469256 -8.948 < 2e-16 ***
## purchase_timestamp_week7 -0.1329664 0.0426327 -3.119 0.001815 **
## lag1_morning 0.0037271 0.0011176 3.335 0.000853 ***
## diff_night 0.0028294 0.0011702 2.418 0.015616 *
## lag1_early_morning 0.0059207 0.0031563 1.876 0.060678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
##
## Null deviance: 2911.61 on 605 degrees of freedom
## Residual deviance: 700.83 on 594 degrees of freedom
## AIC: 6151.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.21
## Std. Err.: 1.17
##
## 2 x log-likelihood: -6125.127
# 後向逐步迴歸
backward_model <- step(full_model, direction = "backward")
## Start: AIC=925719.3
## orders ~ purchase_timestamp_week + diff_orders + diff_morning +
## diff_afternoon + diff_night + diff_early_morning + lag1_morning +
## lag1_afternoon + lag1_night + lag1_early_morning + diff_SP +
## diff_MG + diff_RJ + diff_others
##
## Df Deviance AIC
## <none> 102651 925719
## - diff_afternoon 1 102656 925722
## - diff_orders 1 102667 925733
## - diff_RJ 1 102670 925736
## - diff_MG 1 102682 925749
## - diff_early_morning 1 102760 925827
## - diff_others 1 102766 925832
## - diff_SP 1 102804 925870
## - diff_morning 1 103000 926066
## - diff_night 1 103215 926282
## - lag1_early_morning 1 103268 926334
## - lag1_morning 1 107686 930752
## - lag1_afternoon 1 114527 937593
## - lag1_night 1 123273 946339
## - purchase_timestamp_week 6 135548 958605
summary(backward_model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + diff_orders +
## diff_morning + diff_afternoon + diff_night + diff_early_morning +
## lag1_morning + lag1_afternoon + lag1_night + lag1_early_morning +
## diff_SP + diff_MG + diff_RJ + diff_others, data = period_state_orders2,
## weights = lag1_SP + lag1_MG + lag1_RJ + lag1_others, init.theta = 38.47013631,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.415e+00 2.578e-03 1712.618 < 2e-16 ***
## purchase_timestamp_week2 -3.264e-01 2.683e-03 -121.630 < 2e-16 ***
## purchase_timestamp_week3 -3.080e-01 2.449e-03 -125.794 < 2e-16 ***
## purchase_timestamp_week4 -3.194e-01 2.544e-03 -125.557 < 2e-16 ***
## purchase_timestamp_week5 -3.708e-01 2.617e-03 -141.655 < 2e-16 ***
## purchase_timestamp_week6 -4.554e-01 2.791e-03 -163.138 < 2e-16 ***
## purchase_timestamp_week7 -1.804e-01 2.722e-03 -66.276 < 2e-16 ***
## diff_orders -2.435e-04 6.114e-05 -3.982 6.85e-05 ***
## diff_morning 1.282e-03 6.864e-05 18.674 < 2e-16 ***
## diff_afternoon -1.497e-04 6.795e-05 -2.202 0.0276 *
## diff_night 1.511e-03 6.355e-05 23.780 < 2e-16 ***
## diff_early_morning -1.778e-03 1.701e-04 -10.451 < 2e-16 ***
## lag1_morning 3.955e-03 5.620e-05 70.373 < 2e-16 ***
## lag1_afternoon 5.323e-03 4.900e-05 108.640 < 2e-16 ***
## lag1_night 7.030e-03 4.920e-05 142.890 < 2e-16 ***
## lag1_early_morning 3.961e-03 1.596e-04 24.814 < 2e-16 ***
## diff_SP -9.590e-04 7.808e-05 -12.283 < 2e-16 ***
## diff_MG -7.136e-04 1.277e-04 -5.589 2.29e-08 ***
## diff_RJ -5.882e-04 1.369e-04 -4.296 1.74e-05 ***
## diff_others 9.507e-04 8.903e-05 10.678 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(38.4701) family taken to be 1)
##
## Null deviance: 544468 on 605 degrees of freedom
## Residual deviance: 102651 on 586 degrees of freedom
## AIC: 925721
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 38.470
## Std. Err.: 0.230
##
## 2 x log-likelihood: -925679.316
# 逐步回歸
stepwise_model <- step(initial_model, scope = list(lower = initial_model, upper = full_model),
direction = "both")
## Start: AIC=7023.9
## orders ~ 1
##
## Df Deviance AIC
## + lag1_night 1 228.03 6609.1
## + lag1_afternoon 1 251.40 6632.5
## + lag1_morning 1 336.62 6717.7
## + lag1_early_morning 1 453.22 6834.3
## + diff_others 1 587.05 6968.1
## + diff_night 1 589.88 6970.9
## + diff_orders 1 590.14 6971.2
## + diff_afternoon 1 591.88 6972.9
## + diff_morning 1 592.03 6973.1
## + diff_SP 1 597.74 6978.8
## + diff_MG 1 598.47 6979.5
## + diff_RJ 1 602.59 6983.6
## + purchase_timestamp_week 6 601.16 6992.2
## + diff_early_morning 1 622.43 7003.5
## <none> 644.84 7023.9
##
## Step: AIC=6335.26
## orders ~ lag1_night
##
## Df Deviance AIC
## + lag1_afternoon 1 578.94 6246.9
## + lag1_morning 1 627.25 6295.2
## + lag1_early_morning 1 642.73 6310.7
## + diff_night 1 652.36 6320.4
## + diff_morning 1 658.43 6326.4
## + diff_RJ 1 659.36 6327.4
## + diff_SP 1 660.71 6328.7
## + purchase_timestamp_week 6 651.05 6329.0
## + diff_orders 1 662.42 6330.4
## + diff_MG 1 663.10 6331.1
## + diff_others 1 664.14 6332.1
## <none> 669.26 6335.3
## + diff_early_morning 1 667.44 6335.4
## + diff_afternoon 1 668.61 6336.6
## - lag1_night 1 1992.21 7656.2
##
## Step: AIC=6238.85
## orders ~ lag1_night + lag1_afternoon
##
## Df Deviance AIC
## + purchase_timestamp_week 6 601.17 6171.5
## + diff_night 1 672.40 6232.8
## <none> 680.48 6238.8
## + lag1_early_morning 1 678.48 6238.9
## + diff_morning 1 678.92 6239.3
## + diff_RJ 1 678.95 6239.3
## + diff_MG 1 679.45 6239.8
## + diff_afternoon 1 679.78 6240.1
## + diff_others 1 679.86 6240.2
## + diff_SP 1 680.03 6240.4
## + lag1_morning 1 680.13 6240.5
## + diff_orders 1 680.32 6240.7
## + diff_early_morning 1 680.37 6240.7
## - lag1_afternoon 1 788.05 6344.4
## - lag1_night 1 879.16 6435.5
##
## Step: AIC=6164.73
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week
##
## Df Deviance AIC
## + lag1_morning 1 684.14 6154.1
## + diff_night 1 691.74 6161.7
## + lag1_early_morning 1 691.79 6161.8
## <none> 696.72 6164.7
## + diff_morning 1 695.47 6165.5
## + diff_others 1 696.01 6166.0
## + diff_MG 1 696.23 6166.2
## + diff_RJ 1 696.29 6166.3
## + diff_afternoon 1 696.37 6166.4
## + diff_SP 1 696.44 6166.4
## + diff_orders 1 696.46 6166.5
## + diff_early_morning 1 696.51 6166.5
## - purchase_timestamp_week 6 790.59 6246.6
## - lag1_night 1 813.05 6279.1
## - lag1_afternoon 1 891.42 6357.4
##
## Step: AIC=6153.98
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning
##
## Df Deviance AIC
## + diff_night 1 694.47 6150.6
## + lag1_early_morning 1 696.85 6153.0
## <none> 699.87 6154.0
## + diff_afternoon 1 699.02 6155.1
## + diff_morning 1 699.25 6155.4
## + diff_others 1 699.33 6155.4
## + diff_RJ 1 699.66 6155.8
## + diff_MG 1 699.68 6155.8
## + diff_SP 1 699.75 6155.9
## + diff_orders 1 699.81 6155.9
## + diff_early_morning 1 699.81 6155.9
## - lag1_morning 1 712.79 6164.9
## - lag1_afternoon 1 767.35 6219.5
## - lag1_night 1 797.75 6249.9
## - purchase_timestamp_week 6 808.73 6250.8
##
## Step: AIC=6150.55
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night
##
## Df Deviance AIC
## + lag1_early_morning 1 696.88 6149.1
## <none> 700.29 6150.5
## + diff_afternoon 1 699.48 6151.7
## + diff_morning 1 699.84 6152.1
## + diff_orders 1 700.07 6152.3
## + diff_others 1 700.09 6152.3
## + diff_RJ 1 700.25 6152.5
## + diff_MG 1 700.25 6152.5
## + diff_SP 1 700.25 6152.5
## + diff_early_morning 1 700.28 6152.5
## - diff_night 1 705.75 6154.0
## - lag1_morning 1 713.66 6161.9
## - lag1_afternoon 1 761.20 6209.5
## - lag1_night 1 795.45 6243.7
## - purchase_timestamp_week 6 806.13 6244.4
##
## Step: AIC=6149.13
## orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning
##
## Df Deviance AIC
## <none> 700.83 6149.1
## + diff_afternoon 1 699.98 6150.3
## + diff_morning 1 700.14 6150.4
## - lag1_early_morning 1 704.26 6150.6
## + diff_others 1 700.38 6150.7
## + diff_MG 1 700.72 6151.0
## + diff_orders 1 700.73 6151.0
## + diff_RJ 1 700.78 6151.1
## + diff_SP 1 700.80 6151.1
## + diff_early_morning 1 700.81 6151.1
## - diff_night 1 706.67 6153.0
## - lag1_morning 1 712.08 6158.4
## - lag1_afternoon 1 755.00 6201.3
## - lag1_night 1 790.83 6237.1
## - purchase_timestamp_week 6 807.41 6243.7
summary(stepwise_model)
##
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2,
## init.theta = 16.2117696, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.0908933 0.0392028 104.352 < 2e-16 ***
## lag1_night 0.0093107 0.0009775 9.525 < 2e-16 ***
## lag1_afternoon 0.0071557 0.0009763 7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990 0.0429577 -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043 0.0431882 -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291 0.0445871 -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511 0.0451792 -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728 0.0469256 -8.948 < 2e-16 ***
## purchase_timestamp_week7 -0.1329664 0.0426327 -3.119 0.001815 **
## lag1_morning 0.0037271 0.0011176 3.335 0.000853 ***
## diff_night 0.0028294 0.0011702 2.418 0.015616 *
## lag1_early_morning 0.0059207 0.0031563 1.876 0.060678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
##
## Null deviance: 2911.61 on 605 degrees of freedom
## Residual deviance: 700.83 on 594 degrees of freedom
## AIC: 6151.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.21
## Std. Err.: 1.17
##
## 2 x log-likelihood: -6125.127
AIC(forward_model)#
## [1] 6151.127
AIC(backward_model)
## [1] 925721.3
AIC(stepwise_model)#
## [1] 6151.127
# 比较模型的BIC值
BIC(forward_model)#
## [1] 6208.416
BIC(backward_model)
## [1] 925813.9
BIC(stepwise_model)#
## [1] 6208.416
透過 AIC 與 BIC
值比較後,採用stepwise_model下得出的結論建立模型:
orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
lag1_morning + diff_night + lag1_early_morning
\(y\) ~ \(week + lag1\_morning + lag1\_afternoon + lag1\_night + lag1\_early\_morning + diff\_night\)
negbin_model4 <- glm.nb(orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
lag1_morning + diff_night + lag1_early_morning,
data = period_state_orders2)
par(mfrow = c(2, 2))
summary(negbin_model4)
##
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning, data = period_state_orders2,
## init.theta = 16.2117696, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.0908933 0.0392028 104.352 < 2e-16 ***
## lag1_night 0.0093107 0.0009775 9.525 < 2e-16 ***
## lag1_afternoon 0.0071557 0.0009763 7.329 2.32e-13 ***
## purchase_timestamp_week2 -0.2847990 0.0429577 -6.630 3.36e-11 ***
## purchase_timestamp_week3 -0.2967043 0.0431882 -6.870 6.42e-12 ***
## purchase_timestamp_week4 -0.2817291 0.0445871 -6.319 2.64e-10 ***
## purchase_timestamp_week5 -0.3382511 0.0451792 -7.487 7.05e-14 ***
## purchase_timestamp_week6 -0.4198728 0.0469256 -8.948 < 2e-16 ***
## purchase_timestamp_week7 -0.1329664 0.0426327 -3.119 0.001815 **
## lag1_morning 0.0037271 0.0011176 3.335 0.000853 ***
## diff_night 0.0028294 0.0011702 2.418 0.015616 *
## lag1_early_morning 0.0059207 0.0031563 1.876 0.060678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.2118) family taken to be 1)
##
## Null deviance: 2911.61 on 605 degrees of freedom
## Residual deviance: 700.83 on 594 degrees of freedom
## AIC: 6151.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.21
## Std. Err.: 1.17
##
## 2 x log-likelihood: -6125.127
plot(negbin_model4)
Residuals plot:
Residuals vs Fitting value:Residuals 分佈在0周圍,資料點呈現向上凸,存在某種趨勢。
QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。
scale location plot:圖呈現下凸,似乎有某種趨勢。
residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。
結論:
模型擬合情況:存在一些趨勢。
Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。
Scale-Location:表明Residuals 的變異數不同,存在異質變異數問題。
異常值和高槓桿點:不存在高槓桿點。
總體上增加了模型解釋力,但存在某種關係。
set.seed(123)
train_index <- createDataPartition(period_state_orders2$orders, p = 0.7, list = FALSE)
train_data <- period_state_orders2[train_index, ]
test_data <- period_state_orders2[-train_index, ]
# train_data
model <- glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
lag1_morning + diff_night + lag1_early_morning,
data = train_data, link = log)
summary(model)
##
## Call:
## glm.nb(formula = orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
## lag1_morning + diff_night + lag1_early_morning, data = train_data,
## link = log, init.theta = 14.76368879)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.033004 0.048960 82.374 < 2e-16 ***
## lag1_night 0.010284 0.001236 8.323 < 2e-16 ***
## lag1_afternoon 0.007684 0.001247 6.161 7.21e-10 ***
## purchase_timestamp_week2 -0.242079 0.053033 -4.565 5.00e-06 ***
## purchase_timestamp_week3 -0.257723 0.054269 -4.749 2.04e-06 ***
## purchase_timestamp_week4 -0.245121 0.056131 -4.367 1.26e-05 ***
## purchase_timestamp_week5 -0.312144 0.055484 -5.626 1.85e-08 ***
## purchase_timestamp_week6 -0.349786 0.060584 -5.774 7.76e-09 ***
## purchase_timestamp_week7 -0.086067 0.051774 -1.662 0.0964 .
## lag1_morning 0.002721 0.001487 1.830 0.0673 .
## diff_night 0.002468 0.001525 1.619 0.1055
## lag1_early_morning 0.005518 0.003925 1.406 0.1598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(14.7637) family taken to be 1)
##
## Null deviance: 1911.26 on 425 degrees of freedom
## Residual deviance: 496.53 on 414 degrees of freedom
## AIC: 4368.4
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 14.76
## Std. Err.: 1.27
##
## 2 x log-likelihood: -4342.396
predictions <- predict(model, test_data, type = "response")
# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE: 2548.18776396762"
print(paste("RMSE: ", rmse))
## [1] "RMSE: 50.4795776920491"
print(paste("AIC: ", aic))
## [1] "AIC: 4368.39560058943"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs. Predicted Orders",
x = "Actual Orders",
y = "Predicted Orders") +
theme_minimal()
交叉驗證:k = 10
# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ lag1_night + lag1_afternoon + purchase_timestamp_week +
lag1_morning + diff_night + lag1_early_morning,
data = period_state_orders2,
method = "glm.nb",
trControl = train_control)
print(model)
## Negative Binomial Generalized Linear Model
##
## 606 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 546, 546, 546, 545, 546, 544, ...
## Resampling results across tuning parameters:
##
## link RMSE Rsquared MAE
## identity NaN NaN NaN
## log 42.21194 0.8080095 28.01818
## sqrt 32.58018 0.8633834 23.63638
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.
模擬效果改善許多,斜率變陡,可是預測大一點的值誤差會變大。
交叉驗證比對後,平均誤差大幅下降,表示預測值與實際值誤差改善很多,且解釋力提升至80%。
各指標判斷:
Mean Absolute Error (MAE)
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 28.01818 \]
Root Mean Squared Error (RMSE)
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 42.21194 \] Coefficient of Determination (R²)
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.8080095 = 80.8\% \]
\(y\) ~ \(week +lag1\_morning +\sqrt{\text{lag1_SP}}+\sqrt{\text{lag1_RJ}}+\sqrt{\text{lag1_others}}\)
negbin_model5 <- glm.nb(orders~ purchase_timestamp_week+
lag1_morning +
sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
data = period_state_orders2)
summary(negbin_model5)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + lag1_morning +
## sqrt(lag1_SP) + sqrt(lag1_RJ) + sqrt(lag1_others), data = period_state_orders2,
## init.theta = 34.71383173, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8420912 0.0556051 51.112 < 2e-16 ***
## purchase_timestamp_week2 -0.2490624 0.0306572 -8.124 4.51e-16 ***
## purchase_timestamp_week3 -0.2651266 0.0307252 -8.629 < 2e-16 ***
## purchase_timestamp_week4 -0.2552279 0.0313604 -8.139 4.00e-16 ***
## purchase_timestamp_week5 -0.3222615 0.0313969 -10.264 < 2e-16 ***
## purchase_timestamp_week6 -0.4142629 0.0318938 -12.989 < 2e-16 ***
## purchase_timestamp_week7 -0.1468184 0.0298879 -4.912 9.00e-07 ***
## lag1_morning -0.0073819 0.0009091 -8.120 4.66e-16 ***
## sqrt(lag1_SP) 0.1416348 0.0090050 15.728 < 2e-16 ***
## sqrt(lag1_RJ) 0.1021541 0.0130552 7.825 5.09e-15 ***
## sqrt(lag1_others) 0.1566234 0.0101773 15.390 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(34.7138) family taken to be 1)
##
## Null deviance: 5383.86 on 605 degrees of freedom
## Residual deviance: 729.36 on 595 degrees of freedom
## AIC: 5793.3
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 34.71
## Std. Err.: 2.84
##
## 2 x log-likelihood: -5769.275
表中顯示每個變數皆為顯著性變數。
par(mfrow = c(2, 2))
plot(negbin_model5)
Residuals plot:
Residuals vs Fitting value:Residuals 分佈在0周圍,比模型4削弱一點趨勢。
QQ plot:右尾有些偏離狀況,表示Residuals 非完全為常態分佈,右尾存在一些異常值。
scale location plot:分佈在0周圍,比模型4削弱一點趨勢。
residuals vs leverage plot:此圖表示沒有零星高槓桿點與離群值。
結論:
模型擬合情況:存在些微趨勢。
Residuals分佈檢驗:右尾呈現偏離狀況,存在異常點或極端值。
Scale-Location:表明Residuals 的變異數有些微不同,存在些微異質變異數問題。
異常值和高槓桿點:不存在高槓桿點。
總體上增加了模型解釋力,但存在某種關係。
set.seed(123)
train_index <- createDataPartition(period_state_orders2$orders, p = 0.7, list = FALSE)
train_data <- period_state_orders2[train_index, ]
test_data <- period_state_orders2[-train_index, ]
# train_data
model <- glm.nb(formula = orders ~ purchase_timestamp_week+
lag1_morning +
sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
data = train_data, link = log)
summary(model)
##
## Call:
## glm.nb(formula = orders ~ purchase_timestamp_week + lag1_morning +
## sqrt(lag1_SP) + sqrt(lag1_RJ) + sqrt(lag1_others), data = train_data,
## link = log, init.theta = 32.84952412)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.695129 0.069360 38.857 < 2e-16 ***
## purchase_timestamp_week2 -0.204346 0.037082 -5.511 3.58e-08 ***
## purchase_timestamp_week3 -0.237659 0.037707 -6.303 2.92e-10 ***
## purchase_timestamp_week4 -0.208892 0.038622 -5.409 6.35e-08 ***
## purchase_timestamp_week5 -0.288544 0.037982 -7.597 3.03e-14 ***
## purchase_timestamp_week6 -0.360307 0.040071 -8.992 < 2e-16 ***
## purchase_timestamp_week7 -0.100134 0.035257 -2.840 0.00451 **
## lag1_morning -0.009504 0.001162 -8.176 2.93e-16 ***
## sqrt(lag1_SP) 0.159265 0.010776 14.780 < 2e-16 ***
## sqrt(lag1_RJ) 0.116784 0.015995 7.301 2.85e-13 ***
## sqrt(lag1_others) 0.156663 0.012092 12.956 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(32.8495) family taken to be 1)
##
## Null deviance: 3658.30 on 425 degrees of freedom
## Residual deviance: 521.75 on 415 degrees of freedom
## AIC: 4104.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 32.85
## Std. Err.: 3.20
##
## 2 x log-likelihood: -4080.775
predictions <- predict(model, test_data, type = "response")
# 評估模型準確度
mse <- mean((test_data$orders - predictions)^2)
rmse <- sqrt(mse)
aic <- AIC(model)
print(paste("MSE: ", mse))
## [1] "MSE: 930.759732232672"
print(paste("RMSE: ", rmse))
## [1] "RMSE: 30.508355121715"
print(paste("AIC: ", aic))
## [1] "AIC: 4104.77492230407"
test_data$predicted_orders <- predict(model, newdata = test_data, type = "response")
ggplot(test_data, aes(x = orders, y = predicted_orders)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs. Predicted Orders",
x = "Actual Orders",
y = "Predicted Orders") +
theme_minimal()
交叉驗證:k = 10
# 訓練模型+交叉驗證
k <- 10
set.seed(123)
train_control <- trainControl(method = "cv", number = k)
model <- train(orders ~ purchase_timestamp_week+
lag1_morning +
sqrt(lag1_SP)+ sqrt(lag1_RJ)+ sqrt(lag1_others),
data = period_state_orders2,
method = "glm.nb",
trControl = train_control)
print(model)
## Negative Binomial Generalized Linear Model
##
## 606 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 546, 546, 546, 545, 546, 544, ...
## Resampling results across tuning parameters:
##
## link RMSE Rsquared MAE
## identity NaN NaN NaN
## log 27.98589 0.8860251 20.97589
## sqrt 26.42287 0.8903867 20.21658
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was link = sqrt.
模擬效果更進一步提升,消除一些趨勢,改善預測值大一點時的誤差情況。
交叉驗證比對後,平均誤差大幅下降,表示預測值與實際值誤差改善很多,且解釋力提升至88%。
各指標判斷:
Mean Absolute Error (MAE)
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| = 20.97589 \]
Root Mean Squared Error (RMSE)
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = 27.98589 \] Coefficient of Determination (R²)
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 0.8860251 = 88.6\% \]
刪除海外定點(異常點)
#刪除異常點
a<- customer_sellers_zip1[which(customer_sellers_zip1$customer_lng>-41 & customer_sellers_zip1$customer_lat>-2),]
a_sf <- st_as_sf(a, coords = c("customer_lng", "customer_lat"), crs = 4326)
# 計算每條路線
#折中線圖
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = a_sf, size = 2,color = "red") +
theme_minimal() +
labs(color = "Location Type")
customer_sellers_zip1 <- customer_sellers_zip1[-which(customer_sellers_zip1$customer_lng>-41 & customer_sellers_zip1$customer_lat>-2),]
library(geosphere)
library(maps) #世界各地地圖
library(rnaturalearth)
library(rnaturalearthhires) # ne_states
library(sf) #simple feature
brazil <- ne_countries(country = "Brazil", returnclass = "sf")
brazil_states <- ne_states(country = "Brazil", returnclass = "sf")
customer_geo <- data.frame(
customer_id = customer_sellers_zip1$customer_id,
customer_lat = customer_sellers_zip1$customer_lat,
customer_lng = customer_sellers_zip1$customer_lng
)
customer_geo <-customer_geo[1:3,]#10萬筆跑太久,[1:100,]是示範
sellers_geo <- data.frame(
seller_id = customer_sellers_zip1$seller_id,
seller_lat = customer_sellers_zip1$seller_lat,
seller_lng = customer_sellers_zip1$seller_lng
)
sellers_geo <-sellers_geo[1:3,]
a <- data.frame(
customer_lat = customer_sellers_zip1$customer_lat,
seller_lat = customer_sellers_zip1$seller_lat
)
b <- data.frame(
customer_lng = customer_sellers_zip1$customer_lng,
seller_lng = customer_sellers_zip1$seller_lng
)
customer_sellers_meanzip<- customer_sellers_zip1 %>% mutate(
mean_lat = apply(a , 1, mean),
mean_lng = apply(b , 1, mean)
)
mean_geo <- data.frame(
mean_lat = customer_sellers_meanzip$mean_lat,
mean_lng = customer_sellers_meanzip$mean_lng
)
mean_geo <-mean_geo[1:3,]
#
customer_geo_sf <- st_as_sf(customer_geo, coords = c("customer_lng", "customer_lat"), crs = 4326)
sellers_geo_sf <- st_as_sf(sellers_geo, coords = c("seller_lng", "seller_lat"), crs = 4326)
mean_geo_sf <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
# 計算每條路線
routes <- data.frame()
for (i in 1:nrow(customer_geo)) {
route <- gcIntermediate(c(customer_geo$customer_lng[i], customer_geo$customer_lat[i]),
c(sellers_geo$seller_lng[i], sellers_geo$seller_lat[i]),
n = 50, addStartEnd = TRUE, sp = TRUE)
route <- st_as_sf(route)
route$customer_id <- customer_geo$customer_id[i]
routes <- rbind(routes, route)
}
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = customer_geo_sf, aes(color = "customers"), size = 2) +
geom_sf(data = sellers_geo_sf , aes(color = "Sellers"), size = 2) +
geom_sf(data = routes, color = "blue") +
scale_color_manual(values = c("customers" = "red", "Sellers" = "green")) +
theme_minimal() +
labs(title = "The distance between the customers and the sellers",
x = "Longitude", y = "Latitude",
color = "Location Type")
#折中線圖
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = customer_geo_sf, aes(color = "customers"), size = 2) +
geom_sf(data = sellers_geo_sf , aes(color = "Sellers"), size = 2) +
geom_sf(data = mean_geo_sf , aes(color = "Midpoint"), size = 2) +
geom_sf(data = routes, color = "blue") +
scale_color_manual(values = c("customers" = "red", "Sellers" = "green","Midpoint"="orange")) +
theme_minimal() +
labs(title = "The distance between the customers and the sellers",
x = "Longitude", y = "Latitude",
color = "Location Type")
#畫出mean點圖
mean_geo <- data.frame(
mean_lat = customer_sellers_meanzip$mean_lat,
mean_lng = customer_sellers_meanzip$mean_lng
)
mean_geo_sf <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = mean_geo_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Midpoint of position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# kmeans k=1
k <- 1
kmeans_result1 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result1$cluster)
mean_geo_sf1 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df1 <- as.data.frame(kmeans_result1$centers)
centers_df1$cluster <- factor(1:nrow(centers_df1)) # 添加cluster標籤
unique(centers_df1)
## mean_lat mean_lng cluster
## 1 -22.02344 -46.7283 1
#k=1
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = mean_geo_sf1, aes(color = cluster), size = 1, color = "orange") +
geom_point(data = as.data.frame(kmeans_result1$centers), aes(x = mean_lng, y = mean_lat),
color = 'red', size = 5, shape = 4) +
geom_text(data = centers_df1,
aes(x = mean_lng, y = mean_lat,
label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")", sep = ""))) +
labs(title = paste("K-means cluster (K=", k, ")", sep=""),
x = "Longitude", y = "Latitude") +
labs(color = "cluster")+
theme_minimal()
k <- 3
set.seed(12)
kmeans_result3 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result3$cluster)
mean_geo_sf3 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df3 <- as.data.frame(kmeans_result3$centers)
centers_df3$cluster <- factor(1:nrow(centers_df3)) # 添加cluster標籤
unique(centers_df3)
## mean_lat mean_lng cluster
## 1 -23.68930 -48.09046 1
## 2 -21.60400 -45.52332 2
## 3 -14.90165 -43.49890 3
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = mean_geo_sf3, aes(color = cluster), size = 1) +
geom_point(data = as.data.frame(kmeans_result3$centers), aes(x = mean_lng, y = mean_lat),
color = 'red', size = 5, shape = 4) +
geom_text(data = centers_df3,
aes(x = mean_lng, y = mean_lat,label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")", sep = "")),
vjust = -1, hjust = 0.5, color = "black", size = 3) +
labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") +
theme_minimal()
觀察WCSS(Within-Cluster Sum of Squares)選取適合的cluster個數
\[
W(C_k) = \sum_{x_i \in C_k}^{129} (x_i - \mu_k)^2, \quad \mu_k = E(x_i)
\text{ where } x_i \in C_k
\]
\[ minimize (\text{tot.withiness}) = minimize \left( \sum_{k=1} W(C_k) \right) = minimize \left( \sum_{k=1} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \right) \]
概念:
\(\qquad\) 繪製cluster
數目與總聚類內平方和 (WCSS) ,觀察WCSS隨cluster個數的變化。
\(\qquad\) \(\text{tot.withiness}\)
衡量聚類的緊密度,希望它盡可能小。
方法:
\(\qquad\) Step 1:計算不同 k
值的聚類演算法(例如,k - means cluster)。例如,透過將 k 從 1
個群變更為 10 個群。
\(\qquad\) Step 2:對於每個 k 值,計算群內總平方和 (WCSS)。
\(\qquad\) Step 3:根據群數 k 繪製WCSS曲線。
\(\qquad\) Step 4:圖中彎曲點的位置通常被認為是適當群數的指標。
# 使用Elbow Method 確定合適K值
wss <- (nrow(mean_geo)-1)*sum(apply(mean_geo[,1:2], 2, var))
for (i in 2:30) {
kmeans_result <- kmeans(mean_geo[, 1:2], centers = i, nstart = 50, iter.max = 100)
wss[i] <- sum(kmeans_result$withinss)
}
k <- 30
wss_df <- tibble(clusters = 1:k, wss = wss)
scree_plot <- ggplot(wss_df, aes(x = clusters, y = wss, group = 1)) +
geom_point(size = 3)+
geom_line() +
scale_x_continuous(breaks = c(1:30)) +
xlab("Number of clusters")+
ylab("Within-Cluster Sum of Squares")+
labs(title = "Within-Cluster Sum of Squares under different K values")
scree_plot +geom_hline(
yintercept = wss,
linetype = 'dashed',
col = c(rep('#000000',4),'#FF0000', rep('#000000', 25))
)
kmeans_result <- kmeans(mean_geo[,1:2], centers=k)
當WCSS 隨著 k 增加出現顯著誤差驟降的位置時,此K值為最優cluster個數。
如圖:此時 k = 5後,WCSS下降速度開始緩慢許多,是一個可以考量的選擇,因此選擇K=5。
Average Silhouette Method( Silhouette Coefficient)
評估cluster品質
概念:
\(\qquad\) 1.
確定每個觀測值在其群內的分佈。較高的 Silhouette Coefficient
表示良好的聚類。
\(\qquad\) 2. Average Silhouette Method 計算不同 k 值的觀測值的平均 Silhouette。
\(\qquad\) 3. 最佳群數 k 是在 k 的一系列可能值上最大化平均 Silhouette 的群數。
\(\qquad\) 4. Silhouette Coefficient
\(s(i)\) 定義為:
\[
s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} , \quad S(i) \in [-1, 1]
\]
\(\qquad\) \(a(i)\) = 第 i 觀測值與群內其他資料點的平均距離(即內部距離)
\(\qquad\) \(b(i)\) = 第 i 觀測值到最近的其他聚類的平均距離(即外部距離)
\(\qquad\) \(s(i)\)) = 1 (聚類內的點非常相似)
\(\qquad\) \(s(i)\) = 0 (觀測值剛好位於兩個聚類的邊界上)
\(\qquad\) \(s(i)\) = -1 (聚類效果很差、觀測值可能被分錯類)
方法:
\(\qquad\) Step 1:計算不同 k
值的聚類演算法(例如:k - means cluster)。例如,透過將 k 從 1
個群變更為 10 個群。
\(\qquad\) Step 2:對於每個 k ,計算觀測值的平均 Silhouette ( average \(s(i)\) )。
\(\qquad\) Step 3:依照群數 k 繪製平均 Silhouette 的曲線。
\(\qquad\) Step 4:最大值的位置被認為是適當的群數。
透過比較觀測資料與隨機生成資料的聚類結果,來確定最佳聚類數量
概念:
\(\qquad\) 1.
如果觀測資料比隨機生成資料能夠更好地形成明確的聚類,
那麼觀測資料的總平方和(\(𝑤_k\))應該明顯低於
\(\qquad\quad\)
隨機生成資料的總平方和。差距愈大,代表聚類愈明顯。
\(\qquad\) 2. 差距統計量:將不同 k
值的群內變異總數與在參考資料下的期望值進行比較,評估給定的聚類數目是否比隨機生成的資料好
。
\[\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^{B}
\log(w_{kb}^*) - \log(w_k)\]
\(\qquad\) 參考資料生成方式:Monte Carlo simulations, For all \(x_i \in D\) , D is the data set
\(\qquad\) Step 1:計算資料範圍 \([min(𝑥_𝑖 ),max(𝑥_𝑗 ) ]\)
\(\qquad\) Step 2:使用均勻分布隨機生成n個觀測值:\(𝑏_𝑘\) ~ \(U(min(x_i),max(x_j)),\hspace{0.5em} k=1,2,...,n\)
方法:
\(\qquad\) Step
1:將觀察到的資料進行聚類,改變聚類數量 \(𝑘 =
1, …, 𝑘_{m𝑎𝑥}\),並計算聚類內平方和 \(𝑤_𝑘\)。
\(\qquad\) Step 2:產生具有隨機均勻分佈的 B 個參考資料集。
\(\qquad\) Step 3:將每個參考資料集與不同數量的群 \(𝑘 = 1, …, 𝑘_{m𝑎𝑥}\) 進行聚類,併計算群內聚類內平方和 \(𝑤_{𝑘𝑏}^∗\) 。
\(\qquad\) Step 4:計算差距統計量
\[\text{Gap}(k) = \frac{1}{B} \sum_{b=1}^{B} \log(w_{kb}^*) - \log(w_k)\]
\[E[\log(w_{kb}^*)] \text{:所有隨機資料集的平均} \log \text{總平方和 ,}\quad \log(w_k) \text{:觀測資料的} \log \text{總平方和}\]
\(\qquad\) Step 5:計算統計資料的標準差 \(s_k\), 令 \(\bar{w} = \frac{1}{B} \sum_{b} \log(w_{kb}^*),\)
\[\text{sd}_{k}= \sqrt{\frac{1}{B} \sum_{b} (\log(w_{kb}^*) - \bar{w})^2} ,\quad s_k = \text{sd}_k \times \sqrt{1 + \frac{1}{B}}\]
\(\qquad\) Step 6:選擇群數 \(k\),此 \(k\) 使得差距統計量在 \(k+1\) 處的值比再\(k\) 處的一個標準差之內的最小值:\[\text{Gap}(k) \geq \text{Gap}(k+1) - S_{k+1}\]
#######抽樣1#####
library(factoextra)
set.seed(121) #重複結果
df <- mean_geo[, 1:2]
sampled_df <- df[sample(nrow(df), 10000), ]#隨機抽一萬筆觀測值
sampled_df_sf <- st_as_sf(sampled_df, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p2_30 <- fviz_nbclust(sampled_df, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p3_30 <- fviz_nbclust(sampled_df, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(121)
p4_30 <- fviz_nbclust(sampled_df, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p2_30
p3_30
p4_30
#####抽樣2#####
set.seed(122) #重複結果
df2 <- mean_geo[, 1:2]
sampled_df2 <- df2[sample(nrow(df2), 10000), ]#隨機抽一萬筆觀測值
sampled_df2_sf <- st_as_sf(sampled_df2, coords = c("mean_lng", "mean_lat"), crs = 4326)
p12 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df2_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p22_30 <- fviz_nbclust(sampled_df2, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p32_30 <- fviz_nbclust(sampled_df2, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(122)
p42_30 <- fviz_nbclust(sampled_df2, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p12
p22_30
p32_30
p42_30
#####抽樣3#####
set.seed(123) #重複結果
df3 <- mean_geo[, 1:2]
sampled_df3 <- df[sample(nrow(df3), 10000), ]#隨機抽一萬筆觀測值
sampled_df3_sf <- st_as_sf(sampled_df3, coords = c("mean_lng", "mean_lat"), crs = 4326)
p13 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df3_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p23_30 <- fviz_nbclust(sampled_df3, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p33_30 <- fviz_nbclust(sampled_df3, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
# Gap statistic
set.seed(123)
p43_30 <- fviz_nbclust(sampled_df3, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p13
p23_30
p33_30
p43_30
#######抽樣4#####
set.seed(124) #重複結果
df4 <- mean_geo[, 1:2]
sampled_df4 <- df4[sample(nrow(df4), 10000), ]#隨機抽一萬筆觀測值
sampled_df4_sf <- st_as_sf(sampled_df4, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df4_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p24_30 <- fviz_nbclust(sampled_df4, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p34_30 <- fviz_nbclust(sampled_df4, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p44_30 <- fviz_nbclust(sampled_df4, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p24_30
p34_30
p44_30
#######抽樣5#####
set.seed(125) #重複結果
df5 <- mean_geo[, 1:2]
sampled_df5 <- df5[sample(nrow(df5), 10000), ]#隨機抽一萬筆觀測值
sampled_df5_sf <- st_as_sf(sampled_df5, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df5_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p25_30 <- fviz_nbclust(sampled_df5, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p35_30 <- fviz_nbclust(sampled_df5, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p45_30 <- fviz_nbclust(sampled_df5, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p25_30
p35_30
p45_30
#######抽樣6#####
set.seed(126) #重複結果
df6 <- mean_geo[, 1:2]
sampled_df6 <- df6[sample(nrow(df6), 10000), ]#隨機抽一萬筆觀測值
sampled_df6_sf <- st_as_sf(sampled_df6, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df6_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p26_30 <- fviz_nbclust(sampled_df6, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p36_30 <- fviz_nbclust(sampled_df6, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p46_30 <- fviz_nbclust(sampled_df6, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p26_30
p36_30
p46_30
#######抽樣7#####
set.seed(127) #重複結果
df7 <- mean_geo[, 1:2]
sampled_df7 <- df7[sample(nrow(df7), 10000), ]#隨機抽一萬筆觀測值
sampled_df7_sf <- st_as_sf(sampled_df7, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df7_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p27_30 <- fviz_nbclust(sampled_df7, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p37_30 <- fviz_nbclust(sampled_df7, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p47_30 <- fviz_nbclust(sampled_df7, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p27_30
p37_30
p47_30
#######抽樣8#####
set.seed(128) #重複結果
df8 <- mean_geo[, 1:2]
sampled_df8 <- df8[sample(nrow(df8), 10000), ]#隨機抽一萬筆觀測值
sampled_df8_sf <- st_as_sf(sampled_df8, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df8_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p28_30 <- fviz_nbclust(sampled_df8, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p38_30 <- fviz_nbclust(sampled_df8, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p48_30 <- fviz_nbclust(sampled_df8, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p28_30
p38_30
p48_30
#######抽樣9#####
set.seed(129) #重複結果
df9 <- mean_geo[, 1:2]
sampled_df9 <- df9[sample(nrow(df9), 10000), ]#隨機抽一萬筆觀測值
sampled_df9_sf <- st_as_sf(sampled_df9, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df9_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p29_30 <- fviz_nbclust(sampled_df9, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p39_30 <- fviz_nbclust(sampled_df9, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p49_30 <- fviz_nbclust(sampled_df9, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p29_30
p39_30
p49_30
#######抽樣10#####
set.seed(1210) #重複結果
df10 <- mean_geo[, 1:2]
sampled_df10 <- df10[sample(nrow(df10), 10000), ]#隨機抽一萬筆觀測值
sampled_df10_sf <- st_as_sf(sampled_df10, coords = c("mean_lng", "mean_lat"), crs = 4326)
p1 <- ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = sampled_df10_sf , aes(color = "Midpoint"), size = 1) +
scale_color_manual(values = c("Midpoint"="orange")) +
theme_minimal() +
labs(title = "Sampling position",
x = "Longitude", y = "Latitude",
color = "Location Type")
# Elbow method
p210_30 <- fviz_nbclust(sampled_df10, kmeans, method = "wss", k.max = 30) +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
p310_30 <- fviz_nbclust(sampled_df10, kmeans, method = "silhouette", k.max = 30)+
labs(subtitle = "Silhouette method")
set.seed(124)
p410_30 <- fviz_nbclust(sampled_df10, kmeans, nstart = 25, method = "gap_stat", nboot = 50, k.max = 30)+
labs(subtitle = "Gap statistic method")
p1
p210_30
p310_30
p410_30
取前20名依照顏色深淺畫線
#####平均10次抽樣做 Gap statistic method 圖表 #####
total_gap <- data.frame(p4_30$data$gap ,p42_30$data$gap,
p43_30$data$gap, p44_30$data$gap,
p45_30$data$gap, p46_30$data$gap,
p47_30$data$gap, p48_30$data$gap,
p49_30$data$gap, p410_30$data$gap)
total_gap <- total_gap %>% mutate(
mean_gap = apply(.,1,mean),
sd_gap = apply(.,1,sd),
k = 1:nrow(total_gap)
)
top_n <- 20
top_mean_gap_indices <- order(total_gap$mean_gap, decreasing = TRUE)[1:top_n]
colors <- colorspace::sequential_hcl(n = top_n, palette = "Heat 2")
ggplot(total_gap, aes(x = k, y = mean_gap)) +
geom_line() +
geom_point() +
lapply(seq_along(top_mean_gap_indices), function(i) {
geom_vline(xintercept = total_gap$k[top_mean_gap_indices[i]],
linetype = "dashed", color = colors[i], size = 1.1)
}) +
scale_x_continuous(breaks = c(1:30)) +
labs(title = "Gap Statistic Method",
subtitle = "Average Gap Statistics over 10 runs",
x = "Number of clusters (K)",
y = "Gap Statistic") +
theme_minimal()
根據10次隨機抽樣結果:
從10萬筆資料中,隨機抽樣1萬筆出來,做各項檢定。
1.(Elbow Method): k = 4 以後 WCSS 值驟降。
2.( Silhouette Coefficient):k = 2 時是分群品質最好的指標,其後的 K 值趨於穩定狀態。
3.(Gap Statistic)生成50個隨機分佈,並從10組 Gap Statistic 資料中,做各cluster的平均值,重新畫圖。
\(\quad\) 按照Gap Statistic指標高低排序:
根據Gap Statistic統計量及現實評估後,建議發展順序:
k = 1 \(\rightarrow\) k = 3 \(\rightarrow\) k = 5 \(\rightarrow\) k = 14
# kmeans k=5
k <- 5
set.seed(12)
kmeans_result5 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result5$cluster)
mean_geo_sf5 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df5 <- as.data.frame(kmeans_result5$centers)
centers_df5$cluster <- factor(1:nrow(centers_df5)) # 添加cluster標籤
unique(centers_df5)
## mean_lat mean_lng cluster
## 1 -25.77325 -49.60844 1
## 2 -23.17350 -47.28295 2
## 3 -21.88553 -44.90911 3
## 4 -14.90212 -42.89579 4
## 5 -19.03393 -49.18551 5
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = mean_geo_sf5, aes(color = cluster), size = 1) +
geom_point(data = as.data.frame(kmeans_result5$centers), aes(x = mean_lng, y = mean_lat),
color = 'red', size = 5, shape = 4) +
geom_text(data = centers_df5,
aes(x = mean_lng, y = mean_lat,
label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")",
sep = "")),
vjust = -1, hjust = 0.5, color = "black", size = 3) +
labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") +
theme_minimal()
# kmeans #k=14
set.seed(12)
k <- 14
kmeans_result14 <- kmeans(mean_geo[,1:2], centers=k)
mean_geo$cluster <- as.factor(kmeans_result14$cluster)
mean_geo_sf14 <- st_as_sf(mean_geo, coords = c("mean_lng", "mean_lat"), crs = 4326)
centers_df14 <- as.data.frame(kmeans_result14$centers)
centers_df14$cluster <- factor(1:nrow(centers_df14)) # 添加cluster標籤
unique(centers_df14)
## mean_lat mean_lng cluster
## 1 -24.89890 -47.93623 1
## 2 -21.42944 -43.60574 2
## 3 -17.51547 -42.91844 3
## 4 -21.50916 -45.68617 4
## 5 -19.36181 -47.72873 5
## 6 -12.90147 -46.64863 6
## 7 -13.94465 -41.22025 7
## 8 -27.31003 -51.30935 8
## 9 -23.15399 -49.97409 9
## 10 -22.32487 -47.83565 10
## 11 -17.04051 -53.11055 11
## 12 -23.01799 -44.95821 12
## 13 -26.18626 -49.37412 13
## 14 -23.44512 -46.66692 14
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_sf(data = mean_geo_sf14, aes(color = cluster), size = 1) +
geom_point(data = as.data.frame(kmeans_result14$centers),
aes(x = mean_lng, y = mean_lat),
color = 'red', size = 5, shape = 4) +
geom_text(data = centers_df14,
aes(x = mean_lng, y = mean_lat,
label = paste("(",round(mean_lng, 2),",", round(mean_lat, 2), ")",
sep = "")),
vjust = ifelse(centers_df14$cluster %in%c(1,10,13,14),1.2,-1),
hjust = ifelse(centers_df14$cluster %in%c(1,2,10,12,13,14),-0.3,0.5),
color = "black", size = 3) +
labs(title = paste("K-means cluster (K=", k, ")", sep=""), x = "Longitude", y = "Latitude") +
theme_minimal()
當中心點 K 從1慢慢增加到5個點時,觀察上: 舊有位置是重疊或附近的狀態,因此可以以此發展慢慢擴增。
# k=1 & 3 & 5
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_point(data = as.data.frame(kmeans_result1$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 1"), size = 5, shape = 16) +
geom_point(data = as.data.frame(kmeans_result3$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 3"), size = 3, shape = 17)+
geom_point(data = as.data.frame(kmeans_result5$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 5"), size = 3, shape = 15)+
scale_color_manual(values = c("cluster 1" = "red", "cluster 3" ="#FF3399","cluster 5"="#FF9933"),
breaks = c("cluster 1", "cluster 3", "cluster 5")) +
labs(title = paste("K-means cluster in different K", sep=""),
x = "Longitude", y = "Latitude",color = "number of K") +
theme_minimal()
當中心點 K 從5 增加到14個點時,觀察上:
cluster 1 起始點在分成14群後依舊是疊在一起的的中心點之一,可視為一開始的倉儲物流中心。
大部分變成1個舊有中心點擴散成兩三點,並非舊有位置是最佳中心地點。
# k=1 & 3 & 5 & 14
ggplot() +
geom_sf(data = brazil_states, fill = "white", color = "black") +
geom_point(data = as.data.frame(kmeans_result1$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 1"), size = 5, shape = 16) +
geom_point(data = as.data.frame(kmeans_result3$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 3"), size = 3, shape = 17)+
geom_point(data = as.data.frame(kmeans_result5$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 5"), size = 3, shape = 15)+
geom_point(data = as.data.frame(kmeans_result14$centers),
aes(x = mean_lng, y = mean_lat, color = "cluster 14"), size = 3, shape = 8, stroke = 1.1)+
scale_color_manual(values = c("cluster 1" = "red", "cluster 3" ="#FF3399",
"cluster 5"="#FF9933","cluster 14"="#66CC00"),
breaks = c("cluster 1", "cluster 3", "cluster 5", "cluster 14")) +
labs(title = paste("K-means cluster in different K", sep=""),
x = "Longitude", y = "Latitude", color = "number of K") +
theme_minimal()